Geography Notebook¶

Part 1 - Geocaching¶

Your friend Kelly is coming to London on vacation and you want to help her plan out her journey- which spots to see and what routes to take to do it in the most efficient way.

The data set that we are going to use is the London tube/overgound locations. To save time, we have already pre-downloaded and cleaned the data set for you. If you are interested in doing this yourself, you can register at the TFL website, get an api key, download the file, and clean it by removing all \t, \n and converting dos line endings to unix.

%pylab inline

import pandas as pd
stations = pd.read_hdf('./datasets/london_stations.h5')

# Let's examine the dataset
stations.head()

Populating the interactive namespace from numpy and matplotlib

You can do some pretty cool stuff with a pandas dataframe, including filtering rows using conditionals like so:

# Only look at the rows of the stations dataframe where the station_name column is 'Aldgate Station'
stations[stations.station_name == 'Aldgate Station']

# Only look at the rows of the stations dataframe where the coordinate-x column is greater than 0
stations[stations['coordinate-x'] > 0].head()

Hold on...

Notice how we use stations.station_name in the first filter but stations['coordinate-x'] > 0 in the second filter?

That is because pandas allows you to access column names both as dictionary keys and as attributes!

Here's a trick question- why doesn't this work?

stations[stations.coordinate-x > 0]

# Let's rename the column names so that there is no `-` in them
stations = stations.rename(columns={'coordinate-x': 'x', 'coordinate-y': 'y', 'station_name': 'name'})

# Sanity check. Also, here's a new function called tail(). Guess what it does?
stations[stations.y > 51.5].tail()

Geo-caching¶

Let's take a quick break from Pandas, and talk about where you are going to tell your friend to go.

We could just tell him/her to visit popular landmarks in London, but what's the fun in that? Geo-caching is all the rage these days, and you want to introduce your friend to it. We are going to generate a list of longitude/latitude coordinates for your friend to visit.

from collections import namedtuple
from random import uniform

# Here is a cool python feature.
# (We could have also used pygeoif.geometry.Point, or just had an un-named tuple/list)
Coordinates = namedtuple('Coord', ['x', 'y'])

# Our London playground is going to be between -1° to 1° longitude and 52° to 51° latitude
# We are going to generate 200 random locations
targets_list = [
    Coordinates(x=uniform(-1, 1), y=uniform(51,52)) 
    for ii in range(200)
]

# Let's list first 5 of them
targets_list[:5]

[Coord(x=-0.8241136591946097, y=51.44460462769485),
 Coord(x=-0.331782046670529, y=51.9043471817114),
 Coord(x=-0.0680554888506224, y=51.19204656752541),
 Coord(x=-0.908544117957282, y=51.672528229293675),
 Coord(x=0.6477197306524864, y=51.637751531692274)]

# We can transform our list of tuples into pandas Series
targets = pd.Series(targets_list)

targets.head()

0    (-0.8241136591946097, 51.44460462769485)
1      (-0.331782046670529, 51.9043471817114)
2    (-0.0680554888506224, 51.19204656752541)
3    (-0.908544117957282, 51.672528229293675)
4    (0.6477197306524864, 51.637751531692274)
dtype: object

# Remember, that you can also access Series and Dataframes using their index
targets.iloc[0]

Coord(x=-0.8241136591946097, y=51.44460462769485)

# Let's separate out x and y into two different columns.
# The apply function maps a function over a pandas Series or Dataframe.
targets = pd.DataFrame({
    'x': targets.apply(lambda p: p.x),
    'y': targets.apply(lambda p: p.y)
})

targets.head()

Data cleaning and sanity checks¶

# Let's make sure that our data is squeeky clean
stations.isnull().any()

x       False
y       False
name    False
dtype: bool

Let the computations begin!¶

Let's assume the world is flat for the moment...

We will use a quick nearest-neighbor lookup to determine the closest location. See this for more details about the KD-tree algorithm we are using.

from scipy.spatial import KDTree

stationsTree = KDTree(stations[['x', 'y']])
targetsTree = KDTree(targets[['x', 'y']])

What is a good way to check how this works?

Well if you have a small enough neighborhood radius, no stations should be neighbors with anybody but themselves Thus, we should expact that the number of neighbours points to be equal to the number of stations.

print(stationsTree.count_neighbors(stationsTree, r=0.001))
print(len(stations))

306
302

Huh? Not equal?

Maybe the radius was too large?

Let's do some quick checks:

0.001° of Latitude correspond to about 111 meters, and 0.001° of Longitude (for London) correspond to about 71.6 meters.

It is certainly possible, that London has some tube stations less than 100 meters away.

Let's try a 10 times smaller radius (so 11 meters North/East or 7 meters West/East).

print(stationsTree.count_neighbors(stationsTree, r=0.0001))

306

# Huh? How about an even smaller radius?
print(stationsTree.count_neighbors(stationsTree, r=0.00001))

304

What does it mean? Are there stations that are less than 11 meters apart?

Let's list them:

stationsTree.query_pairs(r=0.001)

{(11, 132), (288, 289)}

stations.iloc[[11, 132, 288, 289]]

Aha! It turns out that Shepherd's Bush Central Station is the same station as Shepherd's Bush Hammersmith & City Station.

Definitely a reminder that we need to be careful and should take time to explore our data before we do any serious analysis.

Let's say we are lazy and don't want to walk far distances. Let's try to find the GeoCaching targets that are close to subway stations.

What targets are within 1000m of a subway station? Let's assume that every 0.0001° is 7m.

radius = 1000.0 / 7 * 0.0001

for idx, target in targets.iterrows():
    close_by_stations_idx = stationsTree.query_ball_point([target.x, target.y], r=radius)
    if close_by_stations_idx:
        stations_name = stations.iloc[close_by_stations_idx]['name'].tolist()
        # string formatting!
        print('Target {} at ({:.2f}°, {:.2f}°) is close to stations {}'.format(
            idx, target.x, target.y, stations_name) )

Target 18 at (-0.35°, 51.56°) is close to stations ['South Harrow Station', 'Sudbury Hill Station']
Target 31 at (-0.28°, 51.50°) is close to stations ['Acton Town Station', 'Chiswick Park Station', 'Gunnersbury Station']
Target 54 at (-0.09°, 51.56°) is close to stations ['Manor House Station']
Target 65 at (-0.22°, 51.42°) is close to stations ['Wimbledon Station']
Target 102 at (-0.23°, 51.54°) is close to stations ['Willesden Green Station', 'Kensal Green Station']
Target 107 at (-0.02°, 51.49°) is close to stations ['Deptford Bridge Station', 'New Cross Station', 'South Quay Station', 'Crossharbour & London Arena Station', 'Mudchute Station']
Target 134 at (-0.09°, 51.56°) is close to stations ['Arsenal Station', 'Manor House Station']
Target 156 at (-0.06°, 51.53°) is close to stations ['Bethnal Green Station', 'Whitechapel Station']
Target 158 at (-0.16°, 51.60°) is close to stations ['East Finchley Station']
Target 195 at (-0.21°, 51.61°) is close to stations ['Mill Hill East Station']

Dig Deeper¶

Here are a couple suggestions to explore further:

Access the google maps api to get transportation times between different subway stations. Then try to figure out what is the quickest way to say visit all 10 targets!
There are a couple ways to do this, and it also depends on your requirements (e.g.: do you want to finish where you started). Take a look at the different algorithms for the shortest path problem, and the travelling salesman problem.
Try to implement a particular algorithm yourself! (maybe learn to set up the algorithm in a separate file, and import it here)
We calculated the shortest distances assuming that the world was flat- which is actually a pretty accurate approximation when considering distances within London. How about calculating the distance based on a globe-shaped geometry. It is pretty easy to use a different distance function, but then you wouldn't be able to use the KD-trees algorithm. If you wanted to keep using that, you want to turn all coordinates into a vector to represent where they are located in space. Then the standard cartesian distance function would still work and so you can keep using KD-trees. One library for this is proj4. Or for this use case, you could just write a quick function to set that up yourself as well.
Once you start getting into graph theory problems like TSP, it could be cool to try setting up a Neo4j database and visualize data that way! (definitely out of the scope of this tutorial though)

Part 2 - Visualization¶

For this part, if you are not doing this on PythonAnywhere and struggling with the basemap install, be sure to install the newest v1.10 version. It can be a bit tricky to install basemap but give it a try- there are many tutorials and resources on how to do it.

First off, let's try to make a map of the UK.

# For the explanation of the parameters, see https://matplotlib.org/basemap/api/basemap_api.html.

# We need to choose a projection. Here we will use 'aea' that means Albers Equal Area Projection,
# a projection, in which we preserve the area surface but not angels (not every right angle will
# be seen as a right).
from mpl_toolkits.basemap import Basemap

m = Basemap(
    llcrnrlon=-10.5, llcrnrlat=49.5, urcrnrlon=3.5, urcrnrlat=59.5,
    resolution='i', projection='aea', lon_0=-4.36, lat_0=54.7
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(-40, 61, 1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-20, 21, 1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Great Britain (in Albers Equal Area Projection)")
plt.show()

Dig Deeper: Now try to use different projections and see how the map changes. Also try out different resolutions. Above we used i (intermediate). Try l (low) or h (high).

Cool. So tt worked with the UK. Let's try to make a map of London.

m = Basemap(
    llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
    resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Londons (in Albers Equal Area Projection)")
plt.show()

Oh naoosss! It's all white space. Where is London?

We will have to add some administration borders. We can find the information here:
https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london

# Download the files and unzip
!rm -rf statistical-gis-boundaries-london*
!wget 'https://files.datapress.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13:52:28/statistical-gis-boundaries-london.zip' 
!unzip -q statistical-gis-boundaries-london.zip

--2018-04-25 22:16:58--  https://files.datapress.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13:52:28/statistical-gis-boundaries-london.zip
Resolving files.datapress.com (files.datapress.com)... 52.85.131.43, 52.85.131.12, 52.85.131.246, ...
Connecting to files.datapress.com (files.datapress.com)|52.85.131.43|:443... connected.
HTTP request sent, awaiting response... 200 OK
Length: 28666674 (27M) [application/zip]
Saving to: ‘statistical-gis-boundaries-london.zip’

100%[======================================>] 28,666,674  25.9MB/s   in 1.1s   

2018-04-25 22:17:00 (25.9 MB/s) - ‘statistical-gis-boundaries-london.zip’ saved [28666674/28666674]

Let's see what we've got. We can load the shapefiles with the administration borders using geopandas. It has very similar syntax to the normal pandas. For more, see this.

import geopandas as gpd

# Let's use Borough level data
borders = gpd.read_file('./statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp')

# Display the first few rows
borders.head()

What we need is the geometry part, i.e. the corners of the polygons that describe the borders of the MSOA regions.

# let's see which kind of projection we are using
borders.crs

{'proj': 'tmerc',
 'lat_0': 49,
 'lon_0': -2,
 'k': 0.999601272,
 'x_0': 400000,
 'y_0': -100000,
 'datum': 'OSGB36',
 'units': 'm',
 'no_defs': True}

We are using tmerc, ie. Transverse Mercator Projection
(see https://matplotlib.org/basemap/users/tmerc.html)

It will be more useful to transform it into the lattitude/longitude coordinates (EPSG:4326)
(see https://en.wikipedia.org/wiki/World_Geodetic_System)

borders = borders.to_crs(epsg='4326')
borders.head()

# Perfect, it works!
# Let's not forget to save the transformed shapes.
borders.to_file("./borders_epsg4326.shp")

# Now we can finally make our map
m = Basemap(
    llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
    resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

# Load the shapefiles again. There must have in lat/lon coordinates.
# Note, that we should not specify shp extension below - it will be added automatically
m.readshapefile('./borders_epsg4326', 'boroughs')

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

plt.title("Londons with boroughs borders")
plt.show()

Let's add now the stations and geo-caching targets that we want to visit.

m = Basemap(
        llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
        resolution='h', projection='aea', lon_0=-0.09, lat_0=51.43)

fig, ax = plt.subplots(figsize=(15,15), dpi=80)

m.drawcoastlines()
m.drawcountries()

m.readshapefile('./borders_epsg4326', 'districts')

m.fillcontinents(color='white',lake_color='aqua')

# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')

# add points
for i, row in stations.iterrows():
    x, y = m(row.x, row.y)
    m.plot(x, y, 'bo', markersize=8)

for i, row in targets.iterrows():
    x, y = m(row.x, row.y)
    m.plot(x, y, 'ro', markersize=8)
    
plt.title("London - stations and targets")
plt.show()

Dig Deeper¶

Play around with the maps- add a legend, label the dots, show only the dots that are close to stations. If you dug deeper previously and solved the TSP problem, try connecting the dots on the map- that would be a cool visualization of the TSP solution.

Lastly, here are some really cool map visualizations. Read, be inspire, and make some cool maps!

	coordinate-x	coordinate-y	station_name
0	-0.280251	51.502750	Acton Town Station
1	-0.075614	51.514272	Aldgate Station
2	-0.072287	51.515233	Aldgate East Station
3	-0.299487	51.540695	Alperton Station
4	-0.607479	51.674150	Amersham Station

	coordinate-x	coordinate-y	station_name
13	0.080863	51.539451	Barking Station
14	0.088511	51.585786	Barkingside Station
17	0.127400	51.540289	Becontree Station
30	0.046744	51.626517	Buckhurst Hill Station
36	0.008171	51.513837	Canning Town Station

	x	y	name
296	-0.074531	51.510567	Tower Gateway Station
297	-0.296859	51.552330	Wembley Central Station
298	-0.020148	51.506638	West India Quay Station
299	-0.025792	51.509432	Westferry Station
300	-0.244289	51.532191	Willesden Junction Station

	x	y	name
11	-0.088916	51.513302	Bank Station
132	-0.088943	51.513348	Monument Station
288	-0.226305	51.505561	Shepherd's Bush Central Station
289	-0.226305	51.505561	Shepherd's Bush Hammersmith & City Station

	NAME	GSS_CODE	HECTARES	NONLD_AREA	ONS_INNER	geometry
0	Kingston upon Thames	E09000021	3726.117	0.000	F	POLYGON ((516401.6 160201.8, 516407.3 160210.5...
1	Croydon	E09000008	8649.441	0.000	F	POLYGON ((535009.2 159504.7, 535005.5 159502, ...
2	Bromley	E09000006	15013.487	0.000	F	POLYGON ((540373.6 157530.4, 540361.2 157551.9...
3	Hounslow	E09000018	5658.541	60.755	F	POLYGON ((521975.8 178100, 521967.7 178096.8, ...
4	Ealing	E09000009	5554.428	0.000	F	POLYGON ((510253.5 182881.6, 510249.9 182886, ...

	x	y
0	-0.824114	51.444605
1	-0.331782	51.904347
2	-0.068055	51.192047
3	-0.908544	51.672528
4	0.647720	51.637752