Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine

As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?

Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.

libdems million jobs

If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…

libdems jobs view source

Ah ha – it’s an embedded map from a Google Fusion Table…

https://www.google.com/fusiontables/embedviz?q=select+col0+from+1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ&viz=MAP&h=false&lat=52.5656923458786&lng=-1.0353351498047232&t=1&z=7&l=col0&y=2&tmplt=3

We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:

https://www.google.com/fusiontables/DataSource?docid=1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ

Lib dems jobs Fusion tables

The description data is curtailed, but we can see the full description on the card view:

Lib dems fusion tables card

Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.

lib dems jobs data in OpenRefine

This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.

Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:

OpenRefine transpose cell rows to cols2

And as if by magic, we have recreated a workable table:-)

Openrefine rows transposed to cols

If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?

openrefinelibdem jobs text facet

If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?

counting job sites

What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?

Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:

libdems - have given text

Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:

import re
tmp=value
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
return tmp

libdems have given amount

Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.

If we cast these values to a number:

openrefine convert given to number

we can then use a numeric facet to help us explore the amounts.

libdems value numeric facet

In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:

  • Summary: Lib Dems in Government have secured a £73,000 grant for …
  • Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
  • Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
  • Summary: Lib Dems in Government have given £982,000 to …

Here’s an improved regular expression for parsing out some more of these amounts:

import re
tmp=value
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)

if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

So now we can start to identify some of the bigger grants…

libdems jobs big amounts

More to add? eg around:
...have secured a £150,000 grant...
Summary: Lib Dems have given a £1,571,000 grant...
Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...

Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?

how many jobs from road junction improvements?

In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:

import re
tmp=value
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

Lib dems jobs created

If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:

  • £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
  • £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant

But that’s for another day…

If you want to play with the data yourself, you can find it here.

Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language

There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners (or Asking Questions of Data – Garment Factories Data Expedition), you can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki [no longer there:-(] that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.

To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).

IW Poll spreadsheet

The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.

To query the data, I’ve popped up a simple query form on Scraperwikia Github site: Google Spreadsheet Explorer

Google spreadsheet explorer

To use the explorer, you need to:

  1. provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
  2. preview the table headings;
  3. construct a query using the column letters;
  4. select the output format;
  5. run the query.

So what sort of questions might we want to ask of the data? Let’s build some up.

We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F

SImple query

We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC

Results in order

Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').

Papers spoilt included

How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC

Postal Turnout

Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'

Good vote turnout

Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.

How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…

Worst 3 by turnout

The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A

Summing rows in a group

In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5

Top 5 by votes cast

How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5

TOp 5 as percentage on roll

Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC

VOtes by party

How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC

Average votes per candidate

To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.

Simple Map Making With Google Fusion Tables

A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…

Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:

data in a fusion table

Now let’s map it:-)

First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…

tweak the column

Then we say we want the column to be recognised as a column type:

change name make location

Computer says yes, highlighting the location type cells with a yellow background:

fusion table.. yellow...

As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)

map tab...

Once the points have been geocoded, they’re placed onto a map:

mapped

We can now publish the map in preparation for sharing it with the world…

publish map

We need to change the visibility of the map to something folk can see!

privacy and link

Public on the web, or just via a shared link – your choice:

make seeable

Here’s my map:-)

The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.

A Quick Peek at Some Charities Data…

As part of the Maximising Your Data Impact launch event of the Civil Society Data Network (reviewed in part by the Guardian here: Open data and the charity sector: a perfect fit), a series of mini-data expeditions provided participants with the opportunity to explore a range of data related questions relevant to the third sector. This was my first data exploration, and rather than deep dive into the data (which I probably should have done!) I was rather more interested in getting a feel for how potential data users actually discussed data related questions.

The topic my group was exploring was how to decide whether it made more sense to make a donation to a small or large charity. Unpacking this question a little forces a consideration of what the donor perceives to be a good use of their donation. To try to make the question more tractable, we also focussed on a particular scenario: how might a donor wishing to contribute to a charity related in some way to the hospice movement draw comparisons between them in order to inform their decision.

A scan of the Charity Commission website reveals that information is available for many charities relating to the size of the organisation as given by the number of staff or number of volunteers, as well as a certain amount of financial reporting:

charity commission comparison

Unfortunately, there doesn’t seem to be a way of comparing charities across a sector, nor does the data appear to be available as such. However, OpenCharities does make the data available in a rather more convenient form (that is, in a machine readable form) at web addresses/URLs of the form http://opencharities.org/charities/OPENCHARITIESID.json.

opencharities json

OpenCharities also publish a CSV file (a simple text based tabular data file) that contains crude summary information about UK registered charities, including such things as the charity name, its OpenCharities ID number, a summary description of each charity’s major activities, its registered address, and its social media profile information. If we grab this data, and pull from it the charities that are of interest to us, we can then use the OpenCharities IDs to create URLs from which we can pull the more detailed data.

Grabbing the CSV data file as charities.csv, we can filter out the rows containing items relating to hospices. From a Linux/Mac terminal command line, we can use the grep tool to grab rows that mention “hospice” somewhere in the line, and then create a new file using those rows appended to the column header row from the original file, putting the result into the file hospices.csv:

grep hospice charities.csv > tmp.csv
head -1 charities.csv | cat tmp.csv > hospice_charities.csv

and then upload the resulting filtered file to a Google spreadsheet.

Alternatively, we could load the whole OpenCharities CSV file into a tool such as OpenRefine and then filter the rows using the text filter on an appropriate column to select just rows mentioning “hospice” within a given column. Using the Custom Tabular Export, we could then upload the data directly to Google Spreadsheet.

Having got a list of hospice related charities into a Google Spreadsheet, we can Share the spreadsheet and also publish it (form the File menu). Publishing Google Spreadsheets makes the data available in a variety of formats, such as CSV data, via a web location/URL.

Our recipe so far is as follows:

– get CSV file of charities on OpenCharities;
– filter the file to get charities associated with hospices;
– upload the filtered file to Google Spreadsheets;
– publish the spreadsheet so we have it available as CSV at known URL.

The next step is to grab the more detailed data down from OpenCharities using the OpenCharities ID to construct the web addresses for where we can find that data for each hospice. We could use a tool such as OpenRefine to do this, but instead I’m going to write a short programme on Scraperwiki (OpenCharities Scraper) using the Python programming language to do the task.

#We need to load in some programme libraries to make life easier
import scraperwiki,csv,urllib2,json

#Here's where we load in the spreadsheet data published on Google Spreadsheets as a CSV file
charities = csv.DictReader((urllib2.urlopen('https://docs.google.com/spreadsheet/pub?key=0AirrQecc6H_vdFVlV0pyd3RVTktuR0xmTTlKY1gwZ3c&single=true&gid=1&output=csv')))

#This function will grab data about a charity from OpenCharities given its OpenCharities ID
def opencharitiesLookup(id):
    url = 'http://opencharities.org/charities/'+id+'.json'
    jsondata = json.load(urllib2.urlopen(url))
    return jsondata

#This routine cycles through each row of data/charity pulled from the spreadsheet
for charity in charities:
    #print charity

    #For each charity, I'm going to pull out several data fields that we'll then save to a database of our own
    #We start with some info from the spreadsheet - charity ID, name and activity summary
    data={'cid':charity['charity_number']}
    for tmp in ['title','activities']:
        data[tmp]=charity[tmp]

    #Here's where we pull in the more detailed data for each charity
    jdata = opencharitiesLookup(charity['charity_number'])

    #Then we start to go hunting for the data...
    chdata = jdata['charity']
    fdata = chdata['financial_breakdown']

    #The data will include employee and volunteer numbers..
    for tmp in ['volunteers','employees']:
        data[tmp] = chdata[tmp]

    #...as well as financial information
    for tmp in ['assets','spending','income']:
        if fdata != None and tmp in fdata:
            for tmp2 in fdata[tmp]:
                data[tmp+'_'+tmp2] = fdata[tmp][tmp2]
    #print data

    #Here's where we save all the data for a charity to a database
    scraperwiki.sqlite.save(unique_keys=['cid'], table_name='hospices', data=data)

When we run the scraper, it looks up the hospice data in the Google Spreadsheet, gets the richer data from OpenCharities, and pops it into a local database on Scraperwiki from where we can download the data:

scraperwiki opencharities

We can now download this enriched data as a CSV file and then, from Google Drive:

open a fusion table from drive

upload it to Google Fusion tables.

The recipe has now been extended as follows:

– pull the list of hospices into Scraperwiki from the Google Spreadsheet
– for each hospice, create an OpenCharities URL that points to the detailed data for that charity
– for each hospice, grab the corresponding data from that OpenCharities URL
– for each hospice, pull out the data elements we want and pop it into a Scraperwiki database
– download the database as a CSV file that now contains detailed data for each charity
– upload the detailed data CSV file to Google Fusion Tables

With the data in Google Fusion Tables, we can now start to analyse it and see what stories it might have to tell:

import csv to fusion table

Here it is…

data in fusion tables

We can then start to build charts around the data:

fusion tables add chart

Using the chart builder, you can specify which quantities to plot against each axis:

building up a fusion chart

(?I don’t know how to associate the title column (name of each charity) with points on the chart, so that when we hover over a point we can see which charity it relates to?)

We can also define out own calculations, much as you would in a spreadsheet.

fusion table add formula

For example, we might be interesting in knowing the ratio of income received from voluntary contributions versus the amount spent soliciting voluntary contributions. If this ratio is greater than 1, more money is spent soliciting these contributions than is received as a result. If the ratio is small (close to zero) then either the money spent soliciting voluntary contributions is effective, or voluntary income is being generated by other other means…

fusion table calc formula

We can then use one or more filters to explore which hospices meet different criteria. For example, how about organisations that seem to get a good voluntary income return versus spend on generating voluntary income, with the proviso that the voluntary income is less than £250,000:

build up filter

Along similar lines, we might want to calculate the ratio of volunteers to employees, and then then view the data for organisations with a certain number of employees (10 to 50, for example), or a certain number of volunteers (less than 500, say).

This is now the point at which story making – story generation, and data story telling – can start to kick in, based on the questions we want to ask of the data, or the stories we want to try to get it to tell us…