Geography Notebook

Part 1 - Geocaching

Your friend Kelly is coming to London on vacation and you want to help her plan out her journey- which spots to see and what routes to take to do it in the most efficient way.

The data set that we are going to use is the London tube/overgound locations. To save time, we have already pre-downloaded and cleaned the data set for you. If you are interested in doing this yourself, you can register at the TFL website, get an api key, download the file, and clean it by removing all \t, \n and converting dos line endings to unix.

In [1]:
%pylab inline

import pandas as pd
stations = pd.read_hdf('./datasets/london_stations.h5')

# Let's examine the dataset
stations.head()
Populating the interactive namespace from numpy and matplotlib
Out[1]:
coordinate-x coordinate-y station_name
0 -0.280251 51.502750 Acton Town Station
1 -0.075614 51.514272 Aldgate Station
2 -0.072287 51.515233 Aldgate East Station
3 -0.299487 51.540695 Alperton Station
4 -0.607479 51.674150 Amersham Station

You can do some pretty cool stuff with a pandas dataframe, including filtering rows using conditionals like so:

In [2]:
# Only look at the rows of the stations dataframe where the station_name column is 'Aldgate Station'
stations[stations.station_name == 'Aldgate Station']
Out[2]:
coordinate-x coordinate-y station_name
1 -0.075614 51.514272 Aldgate Station
In [3]:
# Only look at the rows of the stations dataframe where the coordinate-x column is greater than 0
stations[stations['coordinate-x'] > 0].head()
Out[3]:
coordinate-x coordinate-y station_name
13 0.080863 51.539451 Barking Station
14 0.088511 51.585786 Barkingside Station
17 0.127400 51.540289 Becontree Station
30 0.046744 51.626517 Buckhurst Hill Station
36 0.008171 51.513837 Canning Town Station

Hold on...

Notice how we use stations.station_name in the first filter but stations['coordinate-x'] > 0 in the second filter?

That is because pandas allows you to access column names both as dictionary keys and as attributes!

Here's a trick question- why doesn't this work?

stations[stations.coordinate-x > 0]
In [4]:
# Let's rename the column names so that there is no `-` in them
stations = stations.rename(columns={'coordinate-x': 'x', 'coordinate-y': 'y', 'station_name': 'name'})

# Sanity check. Also, here's a new function called tail(). Guess what it does?
stations[stations.y > 51.5].tail()
Out[4]:
x y name
296 -0.074531 51.510567 Tower Gateway Station
297 -0.296859 51.552330 Wembley Central Station
298 -0.020148 51.506638 West India Quay Station
299 -0.025792 51.509432 Westferry Station
300 -0.244289 51.532191 Willesden Junction Station

Geo-caching

Let's take a quick break from Pandas, and talk about where you are going to tell your friend to go.

We could just tell him/her to visit popular landmarks in London, but what's the fun in that? Geo-caching is all the rage these days, and you want to introduce your friend to it. We are going to generate a list of longitude/latitude coordinates for your friend to visit.

In [5]:
from collections import namedtuple
from random import uniform

# Here is a cool python feature.
# (We could have also used pygeoif.geometry.Point, or just had an un-named tuple/list)
Coordinates = namedtuple('Coord', ['x', 'y'])

# Our London playground is going to be between -1° to 1° longitude and 52° to 51° latitude
# We are going to generate 200 random locations
targets_list = [
    Coordinates(x=uniform(-1, 1), y=uniform(51,52)) 
    for ii in range(200)
]

# Let's list first 5 of them
targets_list[:5]
Out[5]:
[Coord(x=-0.8241136591946097, y=51.44460462769485),
 Coord(x=-0.331782046670529, y=51.9043471817114),
 Coord(x=-0.0680554888506224, y=51.19204656752541),
 Coord(x=-0.908544117957282, y=51.672528229293675),
 Coord(x=0.6477197306524864, y=51.637751531692274)]
In [6]:
# We can transform our list of tuples into pandas Series
targets = pd.Series(targets_list)

targets.head()
Out[6]:
0    (-0.8241136591946097, 51.44460462769485)
1      (-0.331782046670529, 51.9043471817114)
2    (-0.0680554888506224, 51.19204656752541)
3    (-0.908544117957282, 51.672528229293675)
4    (0.6477197306524864, 51.637751531692274)
dtype: object
In [7]:
# Remember, that you can also access Series and Dataframes using their index
targets.iloc[0]
Out[7]:
Coord(x=-0.8241136591946097, y=51.44460462769485)
In [8]:
# Let's separate out x and y into two different columns.
# The apply function maps a function over a pandas Series or Dataframe.
targets = pd.DataFrame({
    'x': targets.apply(lambda p: p.x),
    'y': targets.apply(lambda p: p.y)
})

targets.head()
Out[8]:
x y
0 -0.824114 51.444605
1 -0.331782 51.904347
2 -0.068055 51.192047
3 -0.908544 51.672528
4 0.647720 51.637752

Data cleaning and sanity checks

In [9]:
# Let's make sure that our data is squeeky clean
stations.isnull().any()
Out[9]:
x       False
y       False
name    False
dtype: bool

Let the computations begin!

Let's assume the world is flat for the moment...

We will use a quick nearest-neighbor lookup to determine the closest location. See this for more details about the KD-tree algorithm we are using.

In [10]:
from scipy.spatial import KDTree

stationsTree = KDTree(stations[['x', 'y']])
targetsTree = KDTree(targets[['x', 'y']])

What is a good way to check how this works?

Well if you have a small enough neighborhood radius, no stations should be neighbors with anybody but themselves Thus, we should expact that the number of neighbours points to be equal to the number of stations.

In [11]:
print(stationsTree.count_neighbors(stationsTree, r=0.001))
print(len(stations))
306
302

Huh? Not equal?

Maybe the radius was too large?

Let's do some quick checks:

0.001° of Latitude correspond to about 111 meters, and 0.001° of Longitude (for London) correspond to about 71.6 meters.

It is certainly possible, that London has some tube stations less than 100 meters away.

Let's try a 10 times smaller radius (so 11 meters North/East or 7 meters West/East).

In [12]:
print(stationsTree.count_neighbors(stationsTree, r=0.0001))
306
In [13]:
# Huh? How about an even smaller radius?
print(stationsTree.count_neighbors(stationsTree, r=0.00001))
304

What does it mean? Are there stations that are less than 11 meters apart?

Let's list them:

In [14]:
stationsTree.query_pairs(r=0.001)
Out[14]:
{(11, 132), (288, 289)}
In [15]:
stations.iloc[[11, 132, 288, 289]]
Out[15]:
x y name
11 -0.088916 51.513302 Bank Station
132 -0.088943 51.513348 Monument Station
288 -0.226305 51.505561 Shepherd's Bush Central Station
289 -0.226305 51.505561 Shepherd's Bush Hammersmith & City Station

Aha! It turns out that Shepherd's Bush Central Station is the same station as Shepherd's Bush Hammersmith & City Station.

Definitely a reminder that we need to be careful and should take time to explore our data before we do any serious analysis.

Let's say we are lazy and don't want to walk far distances. Let's try to find the GeoCaching targets that are close to subway stations.

What targets are within 1000m of a subway station? Let's assume that every 0.0001° is 7m.

In [16]:
radius = 1000.0 / 7 * 0.0001

for idx, target in targets.iterrows():
    close_by_stations_idx = stationsTree.query_ball_point([target.x, target.y], r=radius)
    if close_by_stations_idx:
        stations_name = stations.iloc[close_by_stations_idx]['name'].tolist()
        # string formatting!
        print('Target {} at ({:.2f}°, {:.2f}°) is close to stations {}'.format(
            idx, target.x, target.y, stations_name) )
Target 18 at (-0.35°, 51.56°) is close to stations ['South Harrow Station', 'Sudbury Hill Station']
Target 31 at (-0.28°, 51.50°) is close to stations ['Acton Town Station', 'Chiswick Park Station', 'Gunnersbury Station']
Target 54 at (-0.09°, 51.56°) is close to stations ['Manor House Station']
Target 65 at (-0.22°, 51.42°) is close to stations ['Wimbledon Station']
Target 102 at (-0.23°, 51.54°) is close to stations ['Willesden Green Station', 'Kensal Green Station']
Target 107 at (-0.02°, 51.49°) is close to stations ['Deptford Bridge Station', 'New Cross Station', 'South Quay Station', 'Crossharbour & London Arena Station', 'Mudchute Station']
Target 134 at (-0.09°, 51.56°) is close to stations ['Arsenal Station', 'Manor House Station']
Target 156 at (-0.06°, 51.53°) is close to stations ['Bethnal Green Station', 'Whitechapel Station']
Target 158 at (-0.16°, 51.60°) is close to stations ['East Finchley Station']
Target 195 at (-0.21°, 51.61°) is close to stations ['Mill Hill East Station']

Dig Deeper

Here are a couple suggestions to explore further:

  • Access the google maps api to get transportation times between different subway stations. Then try to figure out what is the quickest way to say visit all 10 targets!

  • There are a couple ways to do this, and it also depends on your requirements (e.g.: do you want to finish where you started). Take a look at the different algorithms for the shortest path problem, and the travelling salesman problem.

  • Try to implement a particular algorithm yourself! (maybe learn to set up the algorithm in a separate file, and import it here)

  • We calculated the shortest distances assuming that the world was flat- which is actually a pretty accurate approximation when considering distances within London. How about calculating the distance based on a globe-shaped geometry. It is pretty easy to use a different distance function, but then you wouldn't be able to use the KD-trees algorithm. If you wanted to keep using that, you want to turn all coordinates into a vector to represent where they are located in space. Then the standard cartesian distance function would still work and so you can keep using KD-trees. One library for this is proj4. Or for this use case, you could just write a quick function to set that up yourself as well.

  • Once you start getting into graph theory problems like TSP, it could be cool to try setting up a Neo4j database and visualize data that way! (definitely out of the scope of this tutorial though)

Part 2 - Visualization

For this part, if you are not doing this on PythonAnywhere and struggling with the basemap install, be sure to install the newest v1.10 version. It can be a bit tricky to install basemap but give it a try- there are many tutorials and resources on how to do it.

First off, let's try to make a map of the UK.

In [17]:
# For the explanation of the parameters, see https://matplotlib.org/basemap/api/basemap_api.html.

# We need to choose a projection. Here we will use 'aea' that means Albers Equal Area Projection,
# a projection, in which we preserve the area surface but not angels (not every right angle will
# be seen as a right).
from mpl_toolkits.basemap import Basemap

m = Basemap(
    llcrnrlon=-10.5, llcrnrlat=49.5, urcrnrlon=3.5, urcrnrlat=59.5,
    resolution='i', projection='aea', lon_0=-4.36, lat_0=54.7
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(-40, 61, 1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-20, 21, 1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Great Britain (in Albers Equal Area Projection)")
plt.show()

Dig Deeper: Now try to use different projections and see how the map changes. Also try out different resolutions. Above we used i (intermediate). Try l (low) or h (high).

Cool. So tt worked with the UK. Let's try to make a map of London.

In [18]:
m = Basemap(
    llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
    resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Londons (in Albers Equal Area Projection)")
plt.show()

Oh naoosss! It's all white space. Where is London?

We will have to add some administration borders. We can find the information here:
https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london

In [19]:
# Download the files and unzip
!rm -rf statistical-gis-boundaries-london*
!wget 'https://files.datapress.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13:52:28/statistical-gis-boundaries-london.zip' 
!unzip -q statistical-gis-boundaries-london.zip
--2018-04-25 22:16:58--  https://files.datapress.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13:52:28/statistical-gis-boundaries-london.zip
Resolving files.datapress.com (files.datapress.com)... 52.85.131.43, 52.85.131.12, 52.85.131.246, ...
Connecting to files.datapress.com (files.datapress.com)|52.85.131.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28666674 (27M) [application/zip]
Saving to: ‘statistical-gis-boundaries-london.zip’

100%[======================================>] 28,666,674  25.9MB/s   in 1.1s   

2018-04-25 22:17:00 (25.9 MB/s) - ‘statistical-gis-boundaries-london.zip’ saved [28666674/28666674]

Let's see what we've got. We can load the shapefiles with the administration borders using geopandas. It has very similar syntax to the normal pandas. For more, see this.

In [20]:
import geopandas as gpd

# Let's use Borough level data
borders = gpd.read_file('./statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp')

# Display the first few rows
borders.head()
Out[20]:
NAME GSS_CODE HECTARES NONLD_AREA ONS_INNER SUB_2009 SUB_2006 geometry
0 Kingston upon Thames E09000021 3726.117 0.000 F POLYGON ((516401.6 160201.8, 516407.3 160210.5...
1 Croydon E09000008 8649.441 0.000 F POLYGON ((535009.2 159504.7, 535005.5 159502, ...
2 Bromley E09000006 15013.487 0.000 F POLYGON ((540373.6 157530.4, 540361.2 157551.9...
3 Hounslow E09000018 5658.541 60.755 F POLYGON ((521975.8 178100, 521967.7 178096.8, ...
4 Ealing E09000009 5554.428 0.000 F POLYGON ((510253.5 182881.6, 510249.9 182886, ...

What we need is the geometry part, i.e. the corners of the polygons that describe the borders of the MSOA regions.

In [21]:
# let's see which kind of projection we are using
borders.crs
Out[21]:
{'proj': 'tmerc',
 'lat_0': 49,
 'lon_0': -2,
 'k': 0.999601272,
 'x_0': 400000,
 'y_0': -100000,
 'datum': 'OSGB36',
 'units': 'm',
 'no_defs': True}

We are using tmerc, ie. Transverse Mercator Projection
(see https://matplotlib.org/basemap/users/tmerc.html)

It will be more useful to transform it into the lattitude/longitude coordinates (EPSG:4326)
(see https://en.wikipedia.org/wiki/World_Geodetic_System)

In [22]:
borders = borders.to_crs(epsg='4326')
borders.head()
Out[22]:
NAME GSS_CODE HECTARES NONLD_AREA ONS_INNER SUB_2009 SUB_2006 geometry
0 Kingston upon Thames E09000021 3726.117 0.000 F POLYGON ((-0.3306790629424527 51.3290110106029...
1 Croydon E09000008 8649.441 0.000 F POLYGON ((-0.06402123962011302 51.318637659874...
2 Bromley E09000006 15013.487 0.000 F POLYGON ((0.0121309385091763 51.29959905965642...
3 Hounslow E09000018 5658.541 60.755 F POLYGON ((-0.2445623945250543 51.4887021763399...
4 Ealing E09000009 5554.428 0.000 F POLYGON ((-0.4118326897314672 51.5340838625760...
In [23]:
# Perfect, it works!
# Let's not forget to save the transformed shapes.
borders.to_file("./borders_epsg4326.shp")
In [24]:
# Now we can finally make our map
m = Basemap(
    llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
    resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

# Load the shapefiles again. There must have in lat/lon coordinates.
# Note, that we should not specify shp extension below - it will be added automatically
m.readshapefile('./borders_epsg4326', 'boroughs')

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Londons with boroughs borders")
plt.show()

Let's add now the stations and geo-caching targets that we want to visit.

In [25]:
m = Basemap(
        llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
        resolution='h', projection='aea', lon_0=-0.09, lat_0=51.43)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.readshapefile('./borders_epsg4326', 'districts')

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

# add points
for i, row in stations.iterrows():
    x, y = m(row.x, row.y)
    m.plot(x, y, 'bo', markersize=8)

for i, row in targets.iterrows():
    x, y = m(row.x, row.y)
    m.plot(x, y, 'ro', markersize=8)
    
plt.title("London - stations and targets")
plt.show()

Dig Deeper

Play around with the maps- add a legend, label the dots, show only the dots that are close to stations. If you dug deeper previously and solved the TSP problem, try connecting the dots on the map- that would be a cool visualization of the TSP solution.

Lastly, here are some really cool map visualizations. Read, be inspire, and make some cool maps!