Your friend Kelly is coming to London on vacation and you want to help her plan out her journey- which spots to see and what routes to take to do it in the most efficient way.
The data set that we are going to use is the London tube/overgound locations. To save time, we have already pre-downloaded and cleaned the data set for you. If you are interested in doing this yourself, you can register at the TFL website, get an api key, download the file, and clean it by removing all \t
, \n
and converting dos line endings to unix.
%pylab inline
import pandas as pd
stations = pd.read_hdf('./datasets/london_stations.h5')
# Let's examine the dataset
stations.head()
You can do some pretty cool stuff with a pandas dataframe, including filtering rows using conditionals like so:
# Only look at the rows of the stations dataframe where the station_name column is 'Aldgate Station'
stations[stations.station_name == 'Aldgate Station']
# Only look at the rows of the stations dataframe where the coordinate-x column is greater than 0
stations[stations['coordinate-x'] > 0].head()
Hold on...
Notice how we use stations.station_name
in the first filter but stations['coordinate-x'] > 0
in the second filter?
That is because pandas allows you to access column names both as dictionary keys and as attributes!
Here's a trick question- why doesn't this work?
stations[stations.coordinate-x > 0]
# Let's rename the column names so that there is no `-` in them
stations = stations.rename(columns={'coordinate-x': 'x', 'coordinate-y': 'y', 'station_name': 'name'})
# Sanity check. Also, here's a new function called tail(). Guess what it does?
stations[stations.y > 51.5].tail()
Let's take a quick break from Pandas, and talk about where you are going to tell your friend to go.
We could just tell him/her to visit popular landmarks in London, but what's the fun in that? Geo-caching is all the rage these days, and you want to introduce your friend to it. We are going to generate a list of longitude/latitude coordinates for your friend to visit.
from collections import namedtuple
from random import uniform
# Here is a cool python feature.
# (We could have also used pygeoif.geometry.Point, or just had an un-named tuple/list)
Coordinates = namedtuple('Coord', ['x', 'y'])
# Our London playground is going to be between -1° to 1° longitude and 52° to 51° latitude
# We are going to generate 200 random locations
targets_list = [
Coordinates(x=uniform(-1, 1), y=uniform(51,52))
for ii in range(200)
]
# Let's list first 5 of them
targets_list[:5]
# We can transform our list of tuples into pandas Series
targets = pd.Series(targets_list)
targets.head()
# Remember, that you can also access Series and Dataframes using their index
targets.iloc[0]
# Let's separate out x and y into two different columns.
# The apply function maps a function over a pandas Series or Dataframe.
targets = pd.DataFrame({
'x': targets.apply(lambda p: p.x),
'y': targets.apply(lambda p: p.y)
})
targets.head()
# Let's make sure that our data is squeeky clean
stations.isnull().any()
Let's assume the world is flat for the moment...
We will use a quick nearest-neighbor lookup to determine the closest location. See this for more details about the KD-tree algorithm we are using.
from scipy.spatial import KDTree
stationsTree = KDTree(stations[['x', 'y']])
targetsTree = KDTree(targets[['x', 'y']])
What is a good way to check how this works?
Well if you have a small enough neighborhood radius, no stations should be neighbors with anybody but themselves Thus, we should expact that the number of neighbours points to be equal to the number of stations.
print(stationsTree.count_neighbors(stationsTree, r=0.001))
print(len(stations))
Huh? Not equal?
Maybe the radius was too large?
Let's do some quick checks:
0.001° of Latitude correspond to about 111 meters, and 0.001° of Longitude (for London) correspond to about 71.6 meters.
It is certainly possible, that London has some tube stations less than 100 meters away.
Let's try a 10 times smaller radius (so 11 meters North/East or 7 meters West/East).
print(stationsTree.count_neighbors(stationsTree, r=0.0001))
# Huh? How about an even smaller radius?
print(stationsTree.count_neighbors(stationsTree, r=0.00001))
What does it mean? Are there stations that are less than 11 meters apart?
Let's list them:
stationsTree.query_pairs(r=0.001)
stations.iloc[[11, 132, 288, 289]]
Aha! It turns out that Shepherd's Bush Central Station
is the same station as Shepherd's Bush Hammersmith & City Station
.
Definitely a reminder that we need to be careful and should take time to explore our data before we do any serious analysis.
Let's say we are lazy and don't want to walk far distances. Let's try to find the GeoCaching targets that are close to subway stations.
What targets are within 1000m of a subway station? Let's assume that every 0.0001° is 7m.
radius = 1000.0 / 7 * 0.0001
for idx, target in targets.iterrows():
close_by_stations_idx = stationsTree.query_ball_point([target.x, target.y], r=radius)
if close_by_stations_idx:
stations_name = stations.iloc[close_by_stations_idx]['name'].tolist()
# string formatting!
print('Target {} at ({:.2f}°, {:.2f}°) is close to stations {}'.format(
idx, target.x, target.y, stations_name) )
Here are a couple suggestions to explore further:
Access the google maps api to get transportation times between different subway stations. Then try to figure out what is the quickest way to say visit all 10 targets!
There are a couple ways to do this, and it also depends on your requirements (e.g.: do you want to finish where you started). Take a look at the different algorithms for the shortest path problem, and the travelling salesman problem.
Try to implement a particular algorithm yourself! (maybe learn to set up the algorithm in a separate file, and import it here)
We calculated the shortest distances assuming that the world was flat- which is actually a pretty accurate approximation when considering distances within London. How about calculating the distance based on a globe-shaped geometry. It is pretty easy to use a different distance function, but then you wouldn't be able to use the KD-trees algorithm. If you wanted to keep using that, you want to turn all coordinates into a vector to represent where they are located in space. Then the standard cartesian distance function would still work and so you can keep using KD-trees. One library for this is proj4. Or for this use case, you could just write a quick function to set that up yourself as well.
Once you start getting into graph theory problems like TSP, it could be cool to try setting up a Neo4j database and visualize data that way! (definitely out of the scope of this tutorial though)
First off, let's try to make a map of the UK.
# For the explanation of the parameters, see https://matplotlib.org/basemap/api/basemap_api.html.
# We need to choose a projection. Here we will use 'aea' that means Albers Equal Area Projection,
# a projection, in which we preserve the area surface but not angels (not every right angle will
# be seen as a right).
from mpl_toolkits.basemap import Basemap
m = Basemap(
llcrnrlon=-10.5, llcrnrlat=49.5, urcrnrlon=3.5, urcrnrlat=59.5,
resolution='i', projection='aea', lon_0=-4.36, lat_0=54.7
)
fig, ax = plt.subplots(figsize=(15,15), dpi=80)
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color='white',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(-40, 61, 1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-20, 21, 1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')
plt.title("Great Britain (in Albers Equal Area Projection)")
plt.show()
Dig Deeper: Now try to use different projections and see how the map changes. Also try out different resolutions. Above we used i
(intermediate). Try l
(low) or h
(high).
Cool. So tt worked with the UK. Let's try to make a map of London.
m = Basemap(
llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)
fig, ax = plt.subplots(figsize=(15,15), dpi=80)
m.drawcoastlines()
m.drawcountries()
m.fillcontinents(color='white',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')
plt.title("Londons (in Albers Equal Area Projection)")
plt.show()
Oh naoosss! It's all white space. Where is London?
We will have to add some administration borders. We can find the information here:
https://data.london.gov.uk/dataset/statistical-gis-boundary-files-london
# Download the files and unzip
!rm -rf statistical-gis-boundaries-london*
!wget 'https://files.datapress.com/london/dataset/statistical-gis-boundary-files-london/2016-10-03T13:52:28/statistical-gis-boundaries-london.zip'
!unzip -q statistical-gis-boundaries-london.zip
Let's see what we've got. We can load the shapefiles with the administration borders using geopandas. It has very similar syntax to the normal pandas. For more, see this.
import geopandas as gpd
# Let's use Borough level data
borders = gpd.read_file('./statistical-gis-boundaries-london/ESRI/London_Borough_Excluding_MHW.shp')
# Display the first few rows
borders.head()
What we need is the geometry part, i.e. the corners of the polygons that describe the borders of the MSOA regions.
# let's see which kind of projection we are using
borders.crs
We are using tmerc
, ie. Transverse Mercator Projection
(see https://matplotlib.org/basemap/users/tmerc.html)
It will be more useful to transform it into the lattitude/longitude coordinates (EPSG:4326)
(see https://en.wikipedia.org/wiki/World_Geodetic_System)
borders = borders.to_crs(epsg='4326')
borders.head()
# Perfect, it works!
# Let's not forget to save the transformed shapes.
borders.to_file("./borders_epsg4326.shp")
# Now we can finally make our map
m = Basemap(
llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
resolution='i', projection='aea', lon_0=-0.09, lat_0=51.43
)
fig, ax = plt.subplots(figsize=(15,15), dpi=80)
m.drawcoastlines()
m.drawcountries()
# Load the shapefiles again. There must have in lat/lon coordinates.
# Note, that we should not specify shp extension below - it will be added automatically
m.readshapefile('./borders_epsg4326', 'boroughs')
m.fillcontinents(color='white',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')
plt.title("Londons with boroughs borders")
plt.show()
Let's add now the stations and geo-caching targets that we want to visit.
m = Basemap(
llcrnrlon=-0.59, llcrnrlat=51.23, urcrnrlon=0.41, urcrnrlat=51.73,
resolution='h', projection='aea', lon_0=-0.09, lat_0=51.43)
fig, ax = plt.subplots(figsize=(15,15), dpi=80)
m.drawcoastlines()
m.drawcountries()
m.readshapefile('./borders_epsg4326', 'districts')
m.fillcontinents(color='white',lake_color='aqua')
# draw parallels and meridians.
m.drawparallels(np.arange(50, 53, 0.1), labels=[False,True,True,False])
m.drawmeridians(np.arange(-1, 2, 0.1), labels=[True,False,False,True])
m.drawmapboundary(fill_color='aqua')
# add points
for i, row in stations.iterrows():
x, y = m(row.x, row.y)
m.plot(x, y, 'bo', markersize=8)
for i, row in targets.iterrows():
x, y = m(row.x, row.y)
m.plot(x, y, 'ro', markersize=8)
plt.title("London - stations and targets")
plt.show()
Play around with the maps- add a legend, label the dots, show only the dots that are close to stations. If you dug deeper previously and solved the TSP problem, try connecting the dots on the map- that would be a cool visualization of the TSP solution.
Lastly, here are some really cool map visualizations. Read, be inspire, and make some cool maps!