We Need to Talk About Geo…

Over the last couple of weeks, I’ve spent a bit of time playing with geodata. Maps are really powerful visualisation techniques, but how do you go about creating them?

One way is to use a bespoke GIS (Geographic Information System) application: tools such as the open source, cross-platform desktop application QGIS, that lets you “create, edit, visualise, analyse and publish geospatial information”.

Another is to take the “small pieces, loosely joined” approach of grabbing the functionality you need from different programming packages and wiring them together.

This “wiring together” takes two forms: in the first case, using standardised file formats we can open, save and transfer data between applications; in the second, different programming languages often have programming libraries that become de facto ways of working with particular sorts of data that are then used within yet other packages. In Python, pandas is widely used for manipulating tabular data, and shapely is widely used for representing geospatial data (point locations, lines, or closed shapes). This are combined in geopandas, and we then see tools and solutions built upon that format as the ecosystem build out further. In the R world, the Tidyverse provides a range of packages designed to work together, and again, an ecosystem of interoperable tools and workflows results.

Having robust building blocks allows higher level tools to be built on top of them designed to perform specific functions. Through working through some simple self-directed (and self-created) problems (for which read: things I wanted to try to do, or build, or wondered how to do), it strikes me once again that the quite ambitious sounding tasks can be completed quite straightforwardly if you can imagine a way of decomposing a problem into separate, discrete parts, looking for ways of solving those parts, and then joining the pieces back together again.

For example, here’s a map of the UK showing Westminster constituencies coloured by the party of the MP as voted for at the last general election:

How would we go about creating such a map?

The answer is quite straightforward if we make use of a geodataset that combines shape information (the boundary lines that make up each constituency, suitably represented) with information about the election result. Data such as that made available by Alasdair Rae, for example.

First things first, we need to obtain the data:

#Define the URL that points to the data file
electiondata_url = 'http://ajrae.staff.shef.ac.uk/wpc/geojson/uk_wpc_2018_with_data.geojson'

#Import the geopandas package for working with tabular and spatial data combined
import geopandas

#Enable inline plotting in Jupyter notebooks
#(Some notebook installations automatically enable this)
%matplotlib inline

#Load the data from the URL
gdf = geopandas.read_file(electiondata_url)

#Optionally preview the first few rows of the data
gdf.head()

That wasn’t too hard to understand, or demonstrate to students, was it?

  • make sure the environment is set up correctly for plotting things
  • import a package that helps you work with a particular sort of data
  • specify the location of a data file
  • automatically download the data into a form you can work with
  • preview the data.

So what’s next?

To generate a choropleth map that shows the majority in a particular constituency, we just need to check the dataframe for the column name that contains the majority values, and then plot the map:

gdf.plot(column='majority')

To control the the size of the rendered map, I need to do a little bit more work (it would be much better if the geopandas package let me do this as part of the .plot() method):

#Set the default plot size
from matplotlib import pyplot as plt
fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='majority', ax=ax)

#Switch off the bounding box drawn round the map so it looks a bit tidier
ax.axis('off');

To plot the map coloured by party, I just need to change the column used as the basis for colouring the map.

fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='Party' , ax=ax)
ax.axis('off');

You should be able to see how the code is more or less exactly the same as the previous bit of code except that I don’t need to import the pyplot package (it’s already loaded) and all I need to change is the column name.

The colours are wrong though — they’re set by default rather than relating to colours we might naturally associated with the parties.

So this is the next problem solving step — how do I associate a colour with a party name?

At the moment this is a bit fiddly (again, geopandas could make this easier), but once I have a recipe I should be able to reuse it to colour other columns using other column-value-to-colour mappings.

from matplotlib.colors import ListedColormap

#Set up color maps by party
partycolors = {'Conservative':'blue',
               'Labour':'red',
               'Independent':'black',
               'Liberal Democrat':'orange',
               'Labour/Co-operative':'red',
               'Green':'green' ,
               'Speaker':'black',
               'DUP':'pink',
               'Sinn Féin':'darkgreen',
               'Scottish National Party':'yellow',
               'Plaid Cymru':'brown'}

#The dataframe seems to assign items to categories based on the selected column sort order
#We can define a color map with a similar sorting
colors = [partycolors[k] for k in sorted(partycolors.keys())]

fig, ax = plt.subplots(1, figsize=(12, 12))

ax = gdf.plot(column='Party', cmap = ListedColormap(colors), ax=ax)
ax.axis('off');

In this case, I load in another helpful package, define a set of party-name-to-colour mappings, use that to generate a list of colour names in the correct order, and then build and use a cmap object within the plot function.

If I wanted to do a similar thing based on another column, all I would have to do is change the partycolors = {} definition and the column name in the plot command: the rest of the code would be reusable.

When you have a piece of code that works, you can wrap it in a function and reuse it, or share it with other people. For example, here’s how I use a function I created for displaying a choropleth map of a particular deprivation index measure for a local authority district and its neighbours (I’ll give the function code later on in the post):

plotNeighbours(gdf,
               'Portsmouth',
               'Education, Skills and Training - Rank of average rank')

Using pandas and geopandas we can easily add data from one source, for example, from an Excel spreadsheet file, to a geopandas dataset. For example, let’s download some local authority boundary files from the ONS and some deprivation data:

import geopandas

#From the downloads area of the page, grab the link for the shapefile download
url='https://opendata.arcgis.com/datasets/7ff28788e1e640de8150fb8f35703f6e_2.zip?outSR=%7B%22wkid%22%3A27700%2C%22latestWkid%22%3A27700%7D'
gdf = geopandas.read_file(url)

#Import pandas package
import pandas as pd

#https://www.gov.uk/government/statistics/english-indices-of-deprivation-2015
#File 10: local authority district summaries
data_url = 'https://assets.publishing.service.gov.uk/government/uploads/system/uploads/attachment_data/file/464464/File_10_ID2015_Local_Authority_District_Summaries.xlsx'

#Download and read in the deprivation data Excel file
df = pd.read_excel(data_url, sheet_name=None)

#Preview the name of the sheets in the data loaded from the Excel file
df.keys()

We can merge the two data files based on a common column, the local authority district codes:

#Merge in data
gdf = pd.merge(gdf, df['Education'],
               how='inner',  #The type of join (what happens if data is in one dataset and not the other)
               left_on='lad16cd', #Column we're merging on in left dataframe
               right_on='Local Authority District code (2013)'#Column we're merging on in right dataframe
              )

And plot a choropleth map of one of the deprivation indicators:

ax = gdf.plot(column='Education, Skills and Training - Average rank')
ax.axis('off');

Just by the by, plotting interactive Google style maps is just as easy as plotting static ones. The folium package helps with that, for example:

import folium

m =  folium.Map(max_zoom=9, location=[54.5, -0.8])
folium.Choropleth(gdf.head(), key_on='feature.properties.lad16cd',
                  data=df['Education'],
                  columns=['Local Authority District code (2013)',
                           'Education, Skills and Training - Rank of average rank'],
            fill_color='YlOrBr').add_to(m)
m

I also created some magic some time ago to try to make folium maps even easier to create: ipython_magic_folium.

To plot a choropleth of a specified local authority and its neighbours, here’s the code behind the function I showed previously:

#Via https://gis.stackexchange.com/a/300262/119781

def plotNeighbours(gdf, region='Milton Keynes',
                   indicator='Education, Skills and Training - Rank of average rank',
                   cmap='OrRd'):
    ''' Plot choropleth for an indicator relative to a specified region and its neighbours. '''

    targetBoundary = gdf[gdf['lad16nm']==region]['geometry'].values[0]
    neighbours = gdf.apply(lambda row: row['geometry'].touches(targetBoundary) or row['geometry']==targetBoundary ,
                           axis=1)

    #Show the data for the selected area and its neighbours
    display(gdf[neighbours][['lad16nm',indicator]].set_index('lad16nm'))

    #Generate choropleth
    ax = gdf[neighbours].plot(column=indicator, cmap=cmap)
    ax.axis('off');

One thing this bit of code does is look for boundaries that touch on the specified boundary. By representing the boundaries as geographical objects, we can use geopandas to manipulate them in a spatially meaningful way.

If you want to try a notebook containing some of these demos, you can launch one on MyBinder here.

So what other ways can we manipulate geographical objects? In the notebook Police API Demo.ipynb I show how we can use the osmnx package to find a walking route between two pubs, convert that route (which is a geographical line object) to a buffered area around the route (for example defining an area that lies within 100m of the route) and then make a call to the Police API to look up crimes in that area in a specified period.

The same notebook also shows how to create a Voronoi diagram based on a series of points that lay within a specified region; specifically, the points were registered crime location points within a particular neigbourhood area and the Voronoi diagram then automatically creates boundaried areas around those points so they can be coloured as in a choropleth map.

The ‘crimes with an area along a route’ and the Voronoi mapping, which are both incredibly powerful ideas and incredibly powerful techniques can be achieved with only a few lines of code. And once the code recipe has been discovered once, it can often be turned into a function and called with a single line of code.

One of the issues with things like geopandas is that the dataframe resides in computer memory. Shapefiles can be quite large, so this may have an adverse affect on your computer. But tools such as spatialite allow you to commit large geodata files to a simple file based SQLite database (no installation or running servers required) and do geo operations on it directly: such as looking for points within a particular boundaried area.

At the moment, SpatiaLite docs leave something to be desired, and finding handy recipes to reuse or work from can be challenging, but there are some out there. And I’ve also started to come up with my own demos. For example, check out this notebook of LSOA Sketches.ipynb that includes examples of how to look up an LSOA code from latitude and longitude co-ordinates. The notebook also shows how to download a database of postcodes into the same database as the shapefiles and then use postcode centroid locations to find which LSOA boundary contains the (centroid) location of a specified postcode.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...