A Quick Peek at Some Charities Data…

As part of the Maximising Your Data Impact launch event of the Civil Society Data Network (reviewed in part by the Guardian here: Open data and the charity sector: a perfect fit), a series of mini-data expeditions provided participants with the opportunity to explore a range of data related questions relevant to the third sector. This was my first data exploration, and rather than deep dive into the data (which I probably should have done!) I was rather more interested in getting a feel for how potential data users actually discussed data related questions.

The topic my group was exploring was how to decide whether it made more sense to make a donation to a small or large charity. Unpacking this question a little forces a consideration of what the donor perceives to be a good use of their donation. To try to make the question more tractable, we also focussed on a particular scenario: how might a donor wishing to contribute to a charity related in some way to the hospice movement draw comparisons between them in order to inform their decision.

A scan of the Charity Commission website reveals that information is available for many charities relating to the size of the organisation as given by the number of staff or number of volunteers, as well as a certain amount of financial reporting:

charity commission comparison

Unfortunately, there doesn’t seem to be a way of comparing charities across a sector, nor does the data appear to be available as such. However, OpenCharities does make the data available in a rather more convenient form (that is, in a machine readable form) at web addresses/URLs of the form http://opencharities.org/charities/OPENCHARITIESID.json.

opencharities json

OpenCharities also publish a CSV file (a simple text based tabular data file) that contains crude summary information about UK registered charities, including such things as the charity name, its OpenCharities ID number, a summary description of each charity’s major activities, its registered address, and its social media profile information. If we grab this data, and pull from it the charities that are of interest to us, we can then use the OpenCharities IDs to create URLs from which we can pull the more detailed data.

Grabbing the CSV data file as charities.csv, we can filter out the rows containing items relating to hospices. From a Linux/Mac terminal command line, we can use the grep tool to grab rows that mention “hospice” somewhere in the line, and then create a new file using those rows appended to the column header row from the original file, putting the result into the file hospices.csv:

grep hospice charities.csv > tmp.csv
head -1 charities.csv | cat tmp.csv > hospice_charities.csv

and then upload the resulting filtered file to a Google spreadsheet.

Alternatively, we could load the whole OpenCharities CSV file into a tool such as OpenRefine and then filter the rows using the text filter on an appropriate column to select just rows mentioning “hospice” within a given column. Using the Custom Tabular Export, we could then upload the data directly to Google Spreadsheet.

Having got a list of hospice related charities into a Google Spreadsheet, we can Share the spreadsheet and also publish it (form the File menu). Publishing Google Spreadsheets makes the data available in a variety of formats, such as CSV data, via a web location/URL.

Our recipe so far is as follows:

– get CSV file of charities on OpenCharities;
– filter the file to get charities associated with hospices;
– upload the filtered file to Google Spreadsheets;
– publish the spreadsheet so we have it available as CSV at known URL.

The next step is to grab the more detailed data down from OpenCharities using the OpenCharities ID to construct the web addresses for where we can find that data for each hospice. We could use a tool such as OpenRefine to do this, but instead I’m going to write a short programme on Scraperwiki (OpenCharities Scraper) using the Python programming language to do the task.

#We need to load in some programme libraries to make life easier
import scraperwiki,csv,urllib2,json

#Here's where we load in the spreadsheet data published on Google Spreadsheets as a CSV file
charities = csv.DictReader((urllib2.urlopen('https://docs.google.com/spreadsheet/pub?key=0AirrQecc6H_vdFVlV0pyd3RVTktuR0xmTTlKY1gwZ3c&single=true&gid=1&output=csv')))

#This function will grab data about a charity from OpenCharities given its OpenCharities ID
def opencharitiesLookup(id):
    url = 'http://opencharities.org/charities/'+id+'.json'
    jsondata = json.load(urllib2.urlopen(url))
    return jsondata

#This routine cycles through each row of data/charity pulled from the spreadsheet
for charity in charities:
    #print charity

    #For each charity, I'm going to pull out several data fields that we'll then save to a database of our own
    #We start with some info from the spreadsheet - charity ID, name and activity summary
    data={'cid':charity['charity_number']}
    for tmp in ['title','activities']:
        data[tmp]=charity[tmp]

    #Here's where we pull in the more detailed data for each charity
    jdata = opencharitiesLookup(charity['charity_number'])

    #Then we start to go hunting for the data...
    chdata = jdata['charity']
    fdata = chdata['financial_breakdown']

    #The data will include employee and volunteer numbers..
    for tmp in ['volunteers','employees']:
        data[tmp] = chdata[tmp]

    #...as well as financial information
    for tmp in ['assets','spending','income']:
        if fdata != None and tmp in fdata:
            for tmp2 in fdata[tmp]:
                data[tmp+'_'+tmp2] = fdata[tmp][tmp2]
    #print data

    #Here's where we save all the data for a charity to a database
    scraperwiki.sqlite.save(unique_keys=['cid'], table_name='hospices', data=data)

When we run the scraper, it looks up the hospice data in the Google Spreadsheet, gets the richer data from OpenCharities, and pops it into a local database on Scraperwiki from where we can download the data:

scraperwiki opencharities

We can now download this enriched data as a CSV file and then, from Google Drive:

open a fusion table from drive

upload it to Google Fusion tables.

The recipe has now been extended as follows:

– pull the list of hospices into Scraperwiki from the Google Spreadsheet
– for each hospice, create an OpenCharities URL that points to the detailed data for that charity
– for each hospice, grab the corresponding data from that OpenCharities URL
– for each hospice, pull out the data elements we want and pop it into a Scraperwiki database
– download the database as a CSV file that now contains detailed data for each charity
– upload the detailed data CSV file to Google Fusion Tables

With the data in Google Fusion Tables, we can now start to analyse it and see what stories it might have to tell:

import csv to fusion table

Here it is…

data in fusion tables

We can then start to build charts around the data:

fusion tables add chart

Using the chart builder, you can specify which quantities to plot against each axis:

building up a fusion chart

(?I don’t know how to associate the title column (name of each charity) with points on the chart, so that when we hover over a point we can see which charity it relates to?)

We can also define out own calculations, much as you would in a spreadsheet.

fusion table add formula

For example, we might be interesting in knowing the ratio of income received from voluntary contributions versus the amount spent soliciting voluntary contributions. If this ratio is greater than 1, more money is spent soliciting these contributions than is received as a result. If the ratio is small (close to zero) then either the money spent soliciting voluntary contributions is effective, or voluntary income is being generated by other other means…

fusion table calc formula

We can then use one or more filters to explore which hospices meet different criteria. For example, how about organisations that seem to get a good voluntary income return versus spend on generating voluntary income, with the proviso that the voluntary income is less than £250,000:

build up filter

Along similar lines, we might want to calculate the ratio of volunteers to employees, and then then view the data for organisations with a certain number of employees (10 to 50, for example), or a certain number of volunteers (less than 500, say).

This is now the point at which story making – story generation, and data story telling – can start to kick in, based on the questions we want to ask of the data, or the stories we want to try to get it to tell us…

Simple Map Making With Google Fusion Tables

A quicker than quick recipe to make a map from a list of addresses in a simple text file using Google Fusion tables…

Here’s some data (grabbed from The Gravesend Reporter via this recipe) in a simple two column CSV format; the first column contains address data. Here’s what it looks like when I import it into Google Fusion Tables:

data in a fusion table

Now let’s map it:-)

First of all we need to tell the application which column contains the data we want to geocode – that is, the addrerss we want Fusion Tables to find the latitude and longitude co-ordinates for…

tweak the column

Then we say we want the column to be recognised as a column type:

change name make location

Computer says yes, highlighting the location type cells with a yellow background:

fusion table.. yellow...

As if by magic a Map tab appears (though possibly not if you are using Google Fusion Tables as apart of a Google Apps account…) The geocoder also accepts hints, so we can make life easier for it by providing one;-)

map tab...

Once the points have been geocoded, they’re placed onto a map:

mapped

We can now publish the map in preparation for sharing it with the world…

publish map

We need to change the visibility of the map to something folk can see!

privacy and link

Public on the web, or just via a shared link – your choice:

make seeable

Here’s my map:-)

The data used to generate this map was originally grabbed from the Gravesend Reporter: Find your polling station ahead of the Kent County Council elections. A walkthrough of how the data was prepared can be found here: A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page.

Asking Questions of Data Contained in a Google Spreadsheet Using a Basic Structured Query Language

There is an old saying along the lines of “give a man a fish and you can feed him for a day; teach a man to fish and you’ll feed him for a lifetime”. The same is true when you learn a little bit about structure queries languages… In the post Asking Questions of Data – Some Simple One-Liners (or Asking Questions of Data – Garment Factories Data Expedition), you can see how the SQL query language could be used to ask questions of an election related dataset hosted on Scraperwiki [no longer there:-(] that had been compiled by scraping a “Notice of Poll” PDF document containing information about election candidates. In this post, we’ll see how a series of queries constructed along very similar lines can be applied to data contained within a Google spreadsheet using the Google Chart Tools Query Language.

To provide some sort of context, I’ll stick with the local election theme, although in this case the focus will be on election results data. If you want to follow along, the data can be found in this Google spreadsheet – Isle of Wight local election data results, May 2013 (the spreadsheet key is 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc).

IW Poll spreadsheet

The data was obtained from a dataset originally published by the OnTheWight hyperlocal blog that was shaped and cleaned using OpenRefine using a data wrangling recipe similar to the one described in A Wrangling Example With OpenRefine: Making “Oven Ready Data”.

To query the data, I’ve popped up a simple query form on Scraperwikia Github site: Google Spreadsheet Explorer

Google spreadsheet explorer

To use the explorer, you need to:

  1. provide a spreadsheet key value and optional sheet number (for example, 0AirrQecc6H_vdEZOZ21sNHpibnhmaEYxbW96dkNxZGc);
  2. preview the table headings;
  3. construct a query using the column letters;
  4. select the output format;
  5. run the query.

So what sort of questions might we want to ask of the data? Let’s build some up.

We might start by just looking at the raw results as they come out of the spreadsheet-as-database: SELECT A,D,E,F

SImple query

We might then want to look at each electoral division seeing the results in rank order: SELECT A,D,E,F WHERE E != 'NA' ORDER BY A,F DESC

Results in order

Let’s bring the spoiled vote count back in: SELECT A,D,E,F WHERE E != 'NA' OR D CONTAINS 'spoil' ORDER BY A,F DESC (we might equally have said OR D = 'Papers spoilt').

Papers spoilt included

How about doing some sums? How does the league table of postal ballot percentages look across each electoral division? SELECT A,100*F/B WHERE D CONTAINS 'Postal' ORDER BY 100*F/B DESC

Postal Turnout

Suppose we want to look at the turnout. The “NoONRoll” column B gives the number of people eligible to vote in each electoral division, which is a good start. Unfortunately, using the data in the spreadsheet we have, we can’t do this for all electoral divisions – the “votes cast” is not necessarily the number of people who voted because some electoral divisions (Brading, St Helens & Bembridge and Nettlestone & Seaview) returned two candidates (which meant people voting were each allowed to cast up to an including two votes; the number of people who voted was in the original OnTheWight dataset). If we bear this caveat in mind, we can run the number for the other electoral divisions though. The Total votes cast is actually the number of “good” votes cast – the turnout was actually the Total votes cast plus the Papers spoilt. Let’s start by calculating the “good vote turnout” for each ward, rank the electoral divisions by turnout (ORDER BY 100*F/B DESC), label the turnout column appropriately (LABEL 100*F/B 'Percentage') and format the results ( FORMAT 100*F/B '#,#0.0') using the query SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B DESC LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0'

Good vote turnout

Remember, the first two results are “nonsense” because electors in those electoral divisions may have cast two votes.

How about the three electoral divisions with the lowest turn out? SELECT A, 100*F/B WHERE D CONTAINS 'Total' ORDER BY 100*F/B ASC LIMIT 3 LABEL 100*F/B 'Percentage' FORMAT 100*F/B '#,#0.0' (Note that the order of the arguments – such as where to put the LIMIT – is important; the wrong order can prevent the query from running…

Worst 3 by turnout

The actual turn out (again, with the caveat in mind!) is the total votes cast plus the spoilt papers. To calculate this percentage, we need to sum the total and spoilt contributions in each electoral division and divide by the size of the electoral roll. To do this, we need to SUM the corresponding quantities in each electoral division. Because multiple (two) rows are summed for each electoral division, we find the size of the electoral roll in each electoral division as SUM(B)/COUNT(B) – that is, we count it twice and divide by the number of times we counted it. The query (without tidying) starts off looking like this: SELECT A,SUM(F)*COUNT(B)/SUM(B) WHERE D CONTAINS 'Total' OR D CONTAINS 'spoil' GROUP BY A

Summing rows in a group

In terms of popularity, who were the top 5 candidates in terms of people receiving the largest number of votes? SELECT D,A, E, F WHERE E!='NA' ORDER BY F DESC LIMIT 5

Top 5 by votes cast

How about if we normalise these numbers by the number of people on the electoral roll in the corresponding areas – SELECT D,A, E, F/B WHERE E!='NA' ORDER BY F/B DESC LIMIT 5

TOp 5 as percentage on roll

Looking at the parties, how did the sum of their votes across all the electoral divisions compare? SELECT E,SUM(F) where E!='NA' GROUP BY E ORDER BY SUM(F) DESC

VOtes by party

How about if we bring in the number of candidates who stood for each party, and normalise by this to calculate the average “votes per candidate” by party? SELECT E,SUM(F),COUNT(F), SUM(F)/COUNT(F) where E!='NA' GROUP BY E ORDER BY SUM(F)/COUNT(F) DESC

Average votes per candidate

To summarise then, in this post, we have seen how we can use a structured query language to interrogate the data contained in a Google Spreadsheet, essentially treating the Google Spreadsheet as if it were a database. The query language can also be used to to perform a series of simple calculations over the data to produce a derived dataset. Unfortunately, the query language does not allow us to nest SELECT statements in the same way we can nest SQL SELECT statements, which limits some of the queries we can run.

Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine

As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?

Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.

libdems million jobs

If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…

libdems jobs view source

Ah ha – it’s an embedded map from a Google Fusion Table…

https://www.google.com/fusiontables/embedviz?q=select+col0+from+1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ&viz=MAP&h=false&lat=52.5656923458786&lng=-1.0353351498047232&t=1&z=7&l=col0&y=2&tmplt=3

We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:

https://www.google.com/fusiontables/DataSource?docid=1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ

Lib dems jobs Fusion tables

The description data is curtailed, but we can see the full description on the card view:

Lib dems fusion tables card

Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.

lib dems jobs data in OpenRefine

This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.

Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:

OpenRefine transpose cell rows to cols2

And as if by magic, we have recreated a workable table:-)

Openrefine rows transposed to cols

If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?

openrefinelibdem jobs text facet

If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?

counting job sites

What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?

Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:

libdems - have given text

Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:

import re
tmp=value
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
return tmp

libdems have given amount

Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.

If we cast these values to a number:

openrefine convert given to number

we can then use a numeric facet to help us explore the amounts.

libdems value numeric facet

In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:

  • Summary: Lib Dems in Government have secured a £73,000 grant for …
  • Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
  • Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
  • Summary: Lib Dems in Government have given £982,000 to …

Here’s an improved regular expression for parsing out some more of these amounts:

import re
tmp=value
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)

if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

So now we can start to identify some of the bigger grants…

libdems jobs big amounts

More to add? eg around:
...have secured a £150,000 grant...
Summary: Lib Dems have given a £1,571,000 grant...
Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...

Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?

how many jobs from road junction improvements?

In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:

import re
tmp=value
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

Lib dems jobs created

If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:

  • £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
  • £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant

But that’s for another day…

If you want to play with the data yourself, you can find it here.

Using OpenRefine to Clean Multiple Documents in the Same Way

When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule.

To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis:

IW council transparency data

To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-)

IW council transparency data download link

If we want to get this data into a site such as OpenSpending.org, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier.

One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that…

Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves:

OPenRefine - import from CSV

The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately.

OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!).

openrefien import csv settings

Here’s what the data looks like once we import it:

preview the datai in openrefine

OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?).

Here’s how we can rename the columns:

openrefien - rename column

Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_).

Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers:

open refine - check amount as numeric

We can preview what the non-numeric values are so we can set about tidying them up…

Openrefine look for non-numerics

So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma.

openrefine transform

We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','')

OpenRefine  - remove comma

Let’s see what effect that has on the numeric facet view:

Open Refine amount after cleaning

Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.)

As far as the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column:

open refine  - add column

Here’s how we define the contents of that column:

oen refine date transfrom

That is: value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

The first part – value.toDate('dd.mm.yy') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd')

We can quickly preview that this step has worked by transforming the column to a date type:

open refine trasnform to date

and then preview it using a timeline facet to check that all seems in order.

OPenrefien timeline facet

(If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.)

We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format:

openrefine history

So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset.

If we copy this history to a text file somewhere, we can then make use of it again and again.

open refine extract

To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply:

open refine  - appply previous transformations

Paste in the transformation script we grabbed from the previous project:

open refne - extract reuse

Here’s the script I used – https://gist.github.com/psychemedia/6087946

When we apply the script, the data is cleaned using the same operations as previously:

open refine cleanded as if by magic

That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-)

Finally, we can export the data for use as required…

open refine export data

Magic:-)

Negative Payments in Local Spending Data

In anticipation of a new R library from School of Data data diva @mihi_tr that will wrap the OpenSpending API and providing access to OpenSpending.org data directly from within R, I thought I’d start doodling around some ideas raised in Identifying Pieces in the Spending Data Jigsaw. In particular, common payment values, repayments/refunds and “balanced payments”, that is, multiple payments where the absolute value of a negative payment matches that of an above zero payment (so far example, -£269.72 and £269.72 would be balanced payments).

The data I’ve grabbed is Isle of Wight Council transparency (spending) data for the financial year 2012/2013. The data was pulled from the Isle of Wight Council website and cleaned using OpenRefine broadly according to the recipe described in Using OpenRefine to Clean Multiple Documents in the Same Way with a couple of additions: a new SupplierNameClustered column, originally created from the SupplierName column, but then cleaned to remove initials, (eg [SD]), direct debit prefixes (DD- etc) and then clustered using a variety of clustering algorithms; and a new unique index column, created with values '_'+row.index. You can find a copy of the data here.

To start with, let’s see how we can identify “popular” transaction amounts:

#We going to use some rearrangement routines...
library(plyr)

#I'm loading in the data from a locally downloaded copy
iw <- read.csv("~/code/Rcode/eduDemos1/caseStudies/IWspending/IWccl2012_13_TransparencyTest4.csv")

#Display most common payment values
commonAmounts=function(df,bcol='Amount',num=5){
  head(arrange(data.frame(table(df[[bcol]])),-Freq),num)
}
ca=commonAmounts(iw)

ca
#     Var1 Freq
#1 1567.72 2177
#2 1930.32 1780
#3  1622.6 1347
#4 1998.08 1253
#5  1642.2 1227

This shows that the most common payment by value is for £1567.72. An obvious question to ask here is: does this correspond to some sort of “standard payment” or tariff? And if so, can we reconcile this against the extent of the delivery of a particular level of service, perhaps as disclosed elsewhere?

We can also generate a summary report to show what transactions correspond to this amount, or the most popular amounts.

#We can then get the rows corresponding to these common payments
commonPayments=function(df,bcol,commonAmounts){
  df[abs(df[[bcol]]) %in% commonAmounts,]
}
cp.df=commonPayments(iw,"Amount",ca$Var1)

More usefully, we might generate “pivot table” style summaries of how the popular payments breakdown with respect to expenses area, supplier, or some other combination of factors. So for example, we can learn that there were 1261 payments of £1567.72 booked to the EF Residential Care services area, and SOMERSET CARE LTD [SB] received 231 payments of £1567.72. (Reporting by the clustered supplier name is often more useful…)

#R does pivot tables, sort of...
#We can also run summary statistics over those common payment rows.
zz1=aggregate(index ~ ServiceArea + Amount, data =cp.df, FUN="length")
head(arrange(zz1,-index))

#                       ServiceArea  Amount index
#1              EF Residential Care 1567.72  1261
#2             EMI Residential Care 1642.20   659
#3             EMI Residential Care 1930.32   645
#4              EF Residential Care 1930.32   561
#5              EF Residential Care 1622.60   530
#6 Elderly Frail Residential Income 1622.60   36

#Another way of achieving a similar result is using the plyr count() function:
head(arrange(count(cp.df, c('ServiceArea','ExpensesType','Amount')),-freq))
#That is:
zz1.1=count(cp.df, c('ServiceArea','ExpensesType','Amount'))

zz2=aggregate(index ~ ServiceArea +ExpensesType+ Amount, data =cp.df, FUN="length")
head(arrange(zz2,-index))
#                       ServiceArea                       ExpensesType  Amount
#1              EF Residential Care                Chgs from Ind Provs 1567.72
#2             EMI Residential Care                Chgs from Ind Provs 1930.32
#3             EMI Residential Care                Chgs from Ind Provs 1642.20
#4              EF Residential Care Charges from Independent Providers 1622.60
#...

zz3=aggregate(index ~ SupplierName+ Amount, data =cp.df, FUN="length")
head(arrange(zz3,-index))
#                SupplierName  Amount index
#1     REDACTED PERSONAL DATA 1567.72   274
#2     SOMERSET CARE LTD [SB] 1567.72   231
#3 ISLAND HEALTHCARE LTD [SB] 1930.32   214
#...

zz4=aggregate(index ~ SupplierNameClustered+ Amount, data =cp.df, FUN="length")
head(arrange(zz4,-index))

If I was a suspicious type, I suppose I might look for suppliers who received one of these popular amounts only once…

Let’s have a search for items declared as overpayments or refunds, perhaps as preliminary work in an investigation about why overpayments are being made…:

#Let's look for overpayments
overpayments=function(df,fcol,bcol='Amount',num=5){
  df=subset(df, grepl('((overpay)|(refund))', df[[fcol]],ignore.case=T))
  df[ order(df[,bcol]), ]
}
rf=overpayments(iw,'ExpensesType')

#View the largest "refund" transactions
head(subset(arrange(rf,Amount),select=c('Date','ServiceArea','ExpensesType','Amount')))
#        Date                           ServiceArea         ExpensesType    Amount
#1 2013-03-28 Elderly Mentally Ill Residential Care     Provider Refunds -25094.16
#2 2012-07-18                   MH Residential Care Refund of Overpaymts -24599.12
#3 2013-03-25 Elderly Mentally Ill Residential Care     Provider Refunds -23163.84

#We can also generate a variety of other reports on the "refund" transactions
head(arrange(aggregate(index ~ SupplierName+ ExpensesType+Amount, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType   Amount index
#1 THE ORCHARD HOUSE CARE HOME Provider Refunds  -434.84     8
#2 THE ORCHARD HOUSE CARE HOME Provider Refunds  -729.91     3
#3         SCIO HEALTHCARE LTD Provider Refunds  -434.84     3
#...
##Which is to say: The Orchard House Care home made 8 refunds of £434.84 and 3 of £729.91

head(arrange(aggregate(index ~ SupplierName+ ExpensesType, data =rf, FUN="length"),-index))
#                 SupplierName     ExpensesType index
#1         SCIO HEALTHCARE LTD Provider Refunds    31
#2           SOMERSET CARE LTD Provider Refunds    31
#3 THE ORCHARD HOUSE CARE HOME Provider Refunds    22
#...

Another area of investigation might be “balanced” payments. Here’s one approach for finding those, based around first identifying negative payments. Let’s also refine the approach in this instance for looking for balanced payments involving the supplier involved in the largest number of negative payments.

##Find which suppliers were involved in most number of negative payments
#Identify negative payments
nn=subset(iw,Amount<=0)
#Count the number of each unique supplier
nnd=data.frame(table(nn$SupplierNameClustered))
#Display the suppliers with the largest count of negative payments
head(arrange(nnd,-Freq))

#                    Var1 Freq
#1 REDACTED PERSONAL DATA 2773
#2      SOUTHERN ELECTRIC  313
#3   BRITISH GAS BUSINESS  102
#4      SOMERSET CARE LTD   75
#...

Let’s look for balanced payments around SOUTHERN ELECTRIC…

#Specific supplier search
#Limit transactions to just transactions involving this supplier
se=subset(iw, SupplierNameClustered=='SOUTHERN ELECTRIC')

##We can also run partial matching searches...
#sw=subset(iw,grepl('SOUTHERN', iw$SupplierNameClustered))

#Now let's search for balanced payments
balanced.items=function(df,bcol){
  #Find the positive amounts
  positems=df[ df[[bcol]]>0, ]
  #Find the negative amounts
  negitems=df[ df[[bcol]]<=0, ]
  
  #Find the absolute unique negative amounts
  uniqabsnegitems=-unique(negitems[[bcol]])
  #Find the unique positive amounts
  uniqpositems=unique(positems[[bcol]])
  
  #Find matching positive and negative amounts
  balitems=intersect(uniqabsnegitems,uniqpositems)
  #Subset the data based on balanced positive and negative amounts
  #bals=subset(se,abs(Amount) %in% balitems)
  bals=df[abs(df[[bcol]]) %in% balitems,]
  #Group the data by sorting, largest absolute amounts first
  bals[ order(-abs(bals[,bcol])), ]
}

dse=balanced.items(se,'Amount')

head(subset(dse,select=c('Directorate','Date','ServiceArea','ExpensesType','Amount')))
#                              Directorate       Date                ServiceArea ExpensesType   Amount
#20423               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity -2770.35
#20424               Economy & Environment 2012-07-11        Sandown Depot (East  Electricity  2770.35
#52004 Chief Executive, Schools & Learning 2013-01-30 Haylands - Playstreet Lane  Electricity  2511.20
#52008 Chief Executive, Schools & Learning 2013-01-31 Haylands - Playstreet Lane  Electricity -2511.20

We can also use graphical techniques to try to spot balanced payments.

#Crude example plot to highlight matched +/1 amounts
se1a=subset(se,ServiceArea=='Ryde Town Hall' & Amount>0)
se1b=subset(se,ServiceArea=='Ryde Town Hall' & Amount<=0)

require(ggplot2)
g=ggplot() + geom_point(data=se1a, aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

In this first image, we see payments over time – the red markers are “minus” amounts on the negative payments. Notice that in some cases balanced payments seem to appear on the same day. If the y-value of a red and a black marker are the same, they are balanced in value. The x-axis is time. Where there is a period of equally spaced marks over x with the same y-value, this may represent a regular scheduled payment.

ryde - energy

g=ggplot(se1a)+geom_point( aes(x=Date,y=Amount),pch=1,size=1)
g=g+geom_point(data=se1b, aes(x=Date,y=-Amount),size=3,col='red',pch=2)
g=g+facet_wrap(~ExpensesType)
g=g+ggtitle('Ryde Town Hall - Energy Payments to Southern Electric (red=-Amount)')
g=g+xlab(NULL)+ylab('Amount (£)')
g=g+theme(axis.text.x = element_text(angle = 45, hjust = 1))
g

We can also facet out the payments by expense type.

ryde-energy facet

Graphical techniques can often help us spot patterns in the data that might be hard to spot just by looking at rows and columns worth of data. In many cases, it is worthwhile developing visual analysis skills to try out quick analyses “by eye” before trying to implement them as more comprehensive analysis scripts.

Extracting Images from PDFs

A quick recipe for extracting images embedded in PDFs (and in particular, extracting photos contained with PDFs…).

For example, Shell Nigeria has a site that lists oil spills along with associated links to PDF docs that contain photos corresponding to the oil spill:

shell ng oil spill

Running an import.io scraper over the site can give a list of all the oil spills along with links to the corresponding PDFs. We can trawl through these links, downloading the PDFs and extracting the images from them.

import os,re
import urllib2

#New OU course will start using pandas, so I need to start getting familiar with it.
#In this case it's overkill, because all I'm using it for is to load in a CSV file...
import pandas as pd

#url='http://s01.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/967426_BenisedeWell11_flowline_at_Amabulou_Photos.pdf'

#Load in the data scraped from Shell
df= pd.read_csv('shell_30_11_13_ng.csv')

errors=[]

#For each line item:
for url in df[df.columns[15]]:
	try:
		print 'trying',url
		u = urllib2.urlopen(url)
		fn=url.split('/')[-1]

		#Grab a local copy of the downloaded picture containing PDF
		localFile = open(fn, 'w')
		localFile.write(u.read())
		localFile.close()
	except:
		print 'error with',url
		errors.append(url)
		continue
	
	#If we look at the filenames/urls, the filenames tend to start with the JIV id
	#...so we can try to extract this and use it as a key
	id=re.split(r'[_-]',fn)[0]

	#I'm going to move the PDFs and the associated images stripped from them in separate folders
	fo='data/'+id
	os.system(' '.join(['mkdir',fo]))
	idp='/'.join([fo,id])
	
	#Try to cope with crappy filenames containing punctuation chars
	fn= re.sub(r'([()&])', r'\\\1', fn)

	#THIS IS THE LINE THAT PULLS OUT THE IMAGES
	#Available via poppler-utils
	#See: http://ubuntugenius.wordpress.com/2012/02/04/how-to-extract-images-from-pdf-documents-in-ubuntulinux/
	#Note: the '; mv' etc etc bit copies the PDF file into the new JIV report directory
	cmd=' '.join(['pdfimages -j',fn, idp, '; mv',fn,fo  ])
	os.system(cmd)
	#Still a couple of errors on filenames
	#just as quick to catch by hand/inspection of files that don't get moved properly
	
print 'Errors',errors

Images in the /data directory at: https://github.com/psychemedia/ScoDa_oil/tree/master/shell-ng

The important line of code in the above is:

pdfimages -j FILENAME OUTPUT_STUB

FILENAME is the PDF you want to extract the images from, OUTPUT_STUB sets the main part of the name of the image files. pdfimages is actually a command line file, which is why we need to run it from the Python script using the os.system call. (I’m running on a Mac – I have no idea how this might work on a Windows machine!)

pdfimages can be downloaded as part of poppler (I think?!)

See also this Stack Exchange question/answer: Extracting images from a PDF

PS to put this data to work a little, I wondered about using the data to generate a WordPress blog with one post per spill.

http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html provides a Python API. First thoughts were:

– generate post containing images and body text made up from data in the associated line from the CSV file.

Example data:

Date Reported Incident Site JIV Date Terrain Cause Estimated Spill Volume (bbl) Clean-up Status Comments Photo
02-Jan-13 10″ Diebu Creek – Nun River Pipeline at Onyoma 05-Jan-13 Swamp Sabotage/Theft 65 Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013. Site Certification was completed on 28th June 2013. http://s06.static-shell.com/content/dam/shell-new/local/country/nga/downloads/pdf/oil-spills/911964_10in_DiebuCreek-NunRiver_pipeline_at_Onyoma_Photos.pdf

So we can pull this out for the body post. We can also parse the image PDF to get the JIV ID. We don’t have lat/long (nor northing/easting) though, so no maps unless we try a crude geocoding of the incident site column (column 2).

A lot of the incidents appear to start with a pipe diameter, so we can maybe pull this out too (eg 8″ in the example above).

We can use things like the cause, terrain, est. spill volume (as a range?), and maybe also an identified pipe diameter, to create tags or categories for the post. This allows us to generate views over particular posts (eg all posts relating to theft/sabotage).

There are several dates contained in the data and we may be able to do something with these – eg to date the post, or maybe as the basis for a timeline view over all the data. We might also be able to start collecting stats on eg the difference between the data reported (col 1) and the JIV date (col 3), or where we can scrape it, look for structure on the clean-up status filed. For example:

Recovery of spilled volume commenced on 6th January 2013 and was completed on 22nd January 2013. Cleanup of residual impacted area was completed on 9th May 2013.

If those phrases are common/templated refrains, we can parse the corresponding dates out?

I should probably also try to pull out the caption text from the image PDF [DONE in code on github] and associate it with a given image? This would be useful for any generated blog post too?

Creating Data from Text – Regular Expressions in OpenRefine

Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).

Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.

More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)

Viva F1 - penalties - messy data

The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver:

VIVA F1 - DISPLAY OF PENALTIES

The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data?

The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.)

What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this:

data types

Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand…

So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine:

Paste in the data

We can then import the data as line items, ignoring blank lines:

import as line based files

The first step I’m going to take tidying up the data is to generate a column that contains the race name:

Pull out race name

The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race.

We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.)

Copy the penalty notice

One of the things that we notice is there are some notices that “Overflow” on to multiple lines:

Missed lines...

We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]):

Find rows that donl;t start with a -

Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell:

OpenRefine edit

Cut and paste as required…

edit a cell

Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above.

Cut and append

By chance, I also notice that using “Prix” to grab just race names was overly optimistic!

overly agressive

Here’s how we could have checked – used the facet as text option on the race column…

Facet on race name

Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines.

Tidy up the big overflow.

Remove any filters and fill down again on the Race column to fix the earlier error…

The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them…

prune blank rows

So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences.

If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data:

value.match(/- ([^\(]*)\s\(([^\)]*)\).*/).join('::')

Drivername grab

Here’s the actual regular expression: - ([^\(]*)\s\(([^\)]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^\(]*) and that precede a space \s followed by an open bracket \): all together ([^\(]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^\)]*)\) Finally, match out all characters to the end of the string .*

The two matches are then joined using ::

drivenae captured

We can then split these values to give driver and team columns:

driveteam split

Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents.

facet on name

We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here:

tweak team

We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through.

So now we have a driver column and a team column too (it’s probably worth changing the column names to match…)

team and driver

Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/)[0].replace(/[€,]/,'').toNumber()

pull out fines

We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column.

tidy up the fines

Add in fine information by hand as required:

edit number

So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example.

What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying:

Session extraction

Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i)[0].toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected:

facet session

So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!)

What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type:

value.match(/.*\s([\d\.]+) km\/h.*/)[0].toNumber()

Speeding

So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.)

What else? It may be worth trying to identify the outcome of each infringement investigation?

value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i)[0].toUppercase()

Outcome grab

Here’s where we’re at now:

useful data so far

If we do a text facet on the outcome column, we see there are several opportunities for clustering the data:

Facet on outcome and cluster

We can try other cluster types too:

other clusters

If we look at the metaphone (soundalike) clusters:

other opportunities for clustering

we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS')

Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS')

And we might further normalise on numbers rather than number words:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')

Clustering again:

further cleaning - in this case

it might make sense to tidy out the (IN THIS CASE… statements:

value.replace(/ \(IN THIS.*/,'')

Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)

exaple of outcome penalty types

Finally, let’s consider what sort of infringement has occurred:

infraction type

If we create a new column from the Infraction column, we can then cluster items into the core infraction type:

cluster core infraction

After a bit of tidying, we can start to narrow down on a key set of facet levels:

example core infringement

Viewing down the list further there may be additional core infringements we might be able to pull out.

So here’s where we are now:

data types

And here’s where we came from:

Viva F1 - penalties - messy data

Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…

Quoting Tukey on Visual Storytelling with Data

Time was when I used to be a reasonably competent scholar, digging into the literature chasing down what folk actually said, and chasing forward to see whether claims had been refuted. Then I fell out of love with the academic literature – too many papers that said nothing, too many papers that contained errors, too many papers…

…but as we start production on a new OU course on “data”, I’m having to get back in to the literature so I can defuse any claims that what I want to say is wrong by showing that someone else has said it before (which, if what is said has been peer reviewed, makes it right…).

One thing I’d forgotten about chasing thought lines through the literature was that at times it can be quite fun… and that it can quite often turn up memorable, or at least quotable, lines.

For example, last night I had a quick skim through some papers by folk hero in the statistics world, John Tukey, and turned up the following:

The habit of building one technique on another — of assembling procedures like something made of erector-set parts — can be especially useful in dealing with data. So too is looking at the same thing in many ways or many things in the same way; an ability to generalize in profitable ways and a liking for a massive search for order. Mathematicians understand how subtle assumptions can make great differences and are used to trying to trace the paths by which this occurs. The mathematician’s great disadvantage in approaching data is his—or her—attitude toward the words “hypothesis” and “hypotheses”.

When you come to deal with real data, formalized models for its behavior are not hypotheses in the mathematician’s sense… . Instead these formalized models are reference situations—base points, if you like — things against which you compare the data you actually have to see how it differs. There are many challenges to all the skills of mathematicians — except implicit trust in hypotheses — in doing just this.

Since no model is to be believed in, no optimization for a single model can offer more than distant guidance. What is needed, and is never more than approximately at hand, is guidance about what to do in a sequence of ever more realistic situations. The analyst of data is lucky if he has some insight into a few terms of this sequence, particularly those not yet mathematized.

Picturing of Data Picturing of data is the extreme case. Why do we use pictures? Most crucially to see behavior we had not explicitly anticipated as possible – for what pictures are best at is revealing the unanticipated; crucially, often as a way of making it easier to perceive and understand things that would otherwise be painfully complex. These are the important uses of pictures.

We can, and too often do, use picturing unimportantly, often wastefully, as a way of supporting the feeble in heart in their belief that something we have just found is really true. For this last purpose, when and if important, we usually need to look at a summary.

Sometimes we can summarize the data neatly with a few numbers, as when we report:
– a fitted line—two numbers,
– an estimated spread of the residuals around the “best” line—one more number,
– a confidence interval for the slope of the “best” line—two final numbers.

When we can summarize matters this simply in numbers, we hardly need a picture and often lose by going to it. When the simplest useful summary involves many more numbers, a picture can be very helpful. To meet our major commitment of asking what lies beyond, in the example asking “What is happening beyond what the line describes!”, a picture can be essential.

The main tasks of pictures are then:
– to reveal the unexpected,
– to make the complex easier to perceive.

Either may be effective for that which is important above all: suggesting the next step in analysis, or offering the next insight. In doing either of these there is much room for mathematics and novelty.

How do we decide what is a “picture” and what is not? The more we feel that we can “taste, touch, and handle” the more we are dealing with a picture. Whether it looks like a graph, or is a list of a few numbers is not important. Tangibility is important—what we strive for most.

[Tukey, John W. “Mathematics and the picturing of data.” In Proceedings of the international congress of mathematicians, vol. 2, pp. 523-531. 1975.]

Wonderful!

Or how about these quotes from Tukey, John W. “Data-based graphics: visual display in the decades to come.” Statistical Science 5, no. 3 (1990): 327-339?

Firstly, on exploratory versus explanatory graphics:

I intend to treat making visual displays as something done by many people who want to communicate – often, on the one hand, to communicate identified phenomena to others, and often, on the other, to communicate unidentified phenomena to themselves. This broad clientele needs a “consumer product,” not an art course. To focus on a broad array of users is not to deny the existence of artists of visual communication, only to recognize how few they are and how small a share in the total volume of communication they can contribute. For such artists,very many statements that follow deserve escape clauses or caveats.

More thoughts on exploration and explanation (transfer of recognition), as well as a distinction between exploration and prospecting:

We all need to be clear that visual display can be very effective in serving two quite different functions, but only if used in correspondingly different ways. On the one hand, it can be essential in helping – or, even, in permitting – us to search in some data for phenomena, just as a prospector searches for gold or uranium.

Our task differs from the usual prospector’s task, in that we are concerned both with phenomena that do occur and with those that might occur but do not. On the other hand, visual display can be very helpful in transferring (to reader, viewer or listener) a recognition of the appearances that indicate the phenomena that deserve report. Indeed, when sufficient precomputation drives appropriately specialized displays, visual display can even also convey the statistical significances or non-significances of these appearances.

There is no reason why a good strategy for prospecting will also be a good strategy for transfer. We can expect some aspects of prospecting strategy (and most techniques) to carry over, but other aspects may not. (We say prospecting because we are optimistic that we may in due course have good lists of possible phenomena. If we do not know what we seek, we are “exploring” not “prospecting.”)

One major difference is in prospecting’s freedom to use multiple pictures. If it takes 5 or 10 kinds of pictures to adequately explore one narrow aspect of the data, the only question is: Will 5 or 10 pictures be needed, or can be condense this to 3 or 4 pictures, without appreciable loss? If “yes” we condense; if “no” we stick to the 5 or 10.

If it takes 500 to 1000 (quite different) pictures, however, our choice can only be between finding a relatively general way to do relatively well with many fewer pictures and asking the computer to sort out some number, perhaps 10 or 20 pictures of “greatest interest.”

In doing transfer, once we have one set of pictures to do what is needed, economy of paper (or plastic) and time (to mention or to read) push even harder toward “no more pictures than needed.” But, even here, we must be very careful not to insist, as a necessity not a desideratum, that a single picture can do it all. If it takes two pictures to convey the message effectively, we must use two.

For prospecting, we will want a bundle of pictures, probably of quite different kinds, so chosen that someone will reveal the presence of any one of possible phenomena, of potentially interesting behaviors, which will often have to be inadequately diverse. Developing, and improving, bundles of pictures for selected combinations of kinds of aspects and kinds of situations will be a continuing task for a long time.

For transfer, we will need a few good styles to transfer each phenomenon of possible importance, so that, when more than one phenomenon deserves transfer, we can choose compatible styles and try to transfer two, or even all, of these phenomena in a single picture. (We can look for the opportunity to do this, but, when we cannot find it, we will use two, or more, pictures as necessary.)

For prospecting, we look at long lists of what might occur – and expect to use many pictures. For transfer, we select short lists of what must be made plain – and use as few pictures as will serve us well.

Then on the power of visual representations, whilst recognising they may not always be up to the job…:

The greatest possibilities of visual display lie in vividness and inescapability of the intended message. A visual display can stop your mental flow in its tracks and make you think. A visual display can force you to notice what you never expected to see. (“Why, that scatter diagram has a hole in the middle!”) On the other hand, if one has to work almost as hard to drag something out of a visual display as one would to drag it out of a table of numbers, the visual display is a poor second to the table, which can easily provide so much more precision. (Here, as elsewhere, artists may deserve an escape clause.)

On design:

Another important aspect of impact is immediacy. One should see the intended at once; one should not even have to wait for it to gradually appear. If a visual display lacks immediacy in thrusting before us one of the phenomena for whose presentation it had been assigned responsibility, we ought to ask why and use the answer to modify the display so its impact will be more immediate.

(For a great example of how to progressively refine a graphic to support the making of a particular point, see this Storytelling With Data post on multifaceted data and story.)

Tukey, who we must recall was writing at a time when powerful statistical graphics tools such as ggplot were still yet to be implemented, also suggests that lessons are to be learned from graphic design for the production of effective statistical charts:

The art of statistical graphics was for a long time a pen-and-pencil cottage industry, with the top professionals skilled with the drafting or mapping pen. In the meantime, graphic designers, especially for books, have had access to different sorts of techniques (the techniques of graphic communication), such as grays of different screen weights, against which, for instance, both white and black lines (and curves) are effective. They also have a set of principles shared in part with the Fine Arts (some written down by Leonardo da Vinci). I do not understand all this well enough to try to tell you about “visual centers” and how attention usually moves when one looks at a picture, but I do know enough to find this area important – and to tell you that more of us need to learn a lot more about it.

Data – what is it good for?

Almost everything we do with data involves comparison – most often between two or more values derived from the data, sometimes between one value derived from the data and some mental reference or standard. The dedication of Richard Hamming’s book on numerical analysis reads “The purpose of computation is insight, not numbers.” We need a book on visual display that at least implies “The purpose of display is comparison (recognition of phenomena), not numbers.”

Tukey also encourages us to think about what data represents, and how it is represented:

Much of what we want to know about the world is naturally expressed as phenomena, as potentially interesting things that can be described in non numerical words. That an economic growth rate has been declining steadily throughout President X’s administration, for example, is a phenomenon, while the fact that the GNP has a given value is a number. With exceptions like “I owe him 27 dollars!” numbers are, when we look deeply enough, mainly of interest because they can be assembled, often only through analysis, to describe phenomena. To me phenomena are the main actors, numbers are the supporting cast. Clearly we most need help with the main actors.

If you really want numbers, presumably for later assembly into a phenomenon, a table is likely to serve you best. The graphic map of Napoleon’s incursion into Russia that so stirs Tufte’s imagination and admiration does quite well in showing the relevant phenomena, in giving the answers to “About where?”, “About when?” and “With roughly what fraction of the original army left?” It serves certain phenomena well. But if we want numbers, we can do better either by reading the digits that may be attached to the graphic – a simple but often effective form of table – or by going to a conventional table.

The questions that visual display (in some graphic mode) answers best are phenomenological (in the sense of the first sentence of this section). For instance:

* Is the value small, medium or large?
* Is the difference, or change, up, down or neutral?
* Is the difference, or change, small, medium or large?
* Do the successive changes grow, shrink or stay roughly constant?
* What about change in ratio terms, perhaps thought of as percent of previous?
* Does the vertical scatter change, as we move from left to right?
* Is the scatter pattern doughnut-shaped?

One way that we will enhance the usefulness of visual display is to find new phenomena of potential interest and then learn how to make displays that will be likely to reveal them, when they are present.

The absence of a positive phenomenon is itself a phenomenon! Such absences as:

* the values are all about the same
* there does not seem to be any definite curvature
* the vertical scatter does not seem to change, as we go from left to right!

are certainly potentially interesting. (We can all find instances where they are interesting.) Thus they are, honestly, phenomena in themselves. We need to be able to view apparent absence of specific phenomena effectively as well as noticing them when they are present! This is one of the reasons why fitting scatter plots with summarizing devices like middle traces (Tukey, 1977a, page 279 ff.) can be important.

Phenomena are also picked up later in the paper:

A graph or chart should not be just another form of table, in which we can look up the facts. If it is to do its part effectively, its focus – or so I believe – will have to be one or more phenomena.

Indeed, the requirement that we can directly read values from a chart seems to be something Tukey takes issue with:

As one who denies “reading off numbers” as the prime purpose of visual display, I can only denounce evaluating displays in terms of how well (given careful study) people read numbers off. If such an approach were to guide us well, it would have to be a very unusual accident.

He even goes so far as to suggest that we might consider being flexible in the way we geometrically map from measurement scales to points on a canvas, moving away from proportionality if that helps us see a phenomenon better:

The purpose of display is to make messages about phenomena clear. There is no place for a doctrinaire approach to “truth in geometry.” We must be honest and say what we did, but this need not mean plotting raw data.

The point is that we have a choice, not only in (x, y)-plots, but more generally. Planned disproportionality needs to be a widely-available option, one that requires the partnership of computation and display.

Tukey is also willing to rethink how we use familiar charts:

Take simple bar charts as an example, which I would define much more generally than many classical authors. Why have they survived? Not because they are geometrically true, and not because they lead to good numerical estimates by the viewer! In my thoughts, their virtue lies in the fact that we can all compare two bars, perhaps only roughly, in two quite different ways, both “About how much difference?” and “About what ratio?” The latter, of course, is often translated into “About how much percent change?” (Going on to three or more successive bars, we can see globally whether the changes in amount are nearly the same, but asking the same question about ratios – rather than differences-requires either tedious assessment of ratios between adjacent bars for one adjacent pair after another…

So – what other Tukey papers should I read?

Recreational Data

Part of my weekend ritual is to buy the weekend papers and have a go at the recreational maths problems that are Sudoku and Killer. I also look for news stories with a data angle that might prompt a bit of recreational data activity…

In a paper that may or may not have been presented at the First European Congress of Mathematics in Paris, July, 1992, Prof. David Singmaster reflected on “The Unreasonable Utility of Recreational Mathematics”.

unreasonableUtility

To begin with, it is worth considering what is meant by recreational mathematics.

First, recreational mathematics is mathematics that is fun and popular – that is, the problems should be understandable to the interested layman, though the solutions may be harder. (However, if the solution is too hard, this may shift the topic from recreational toward the serious – e.g. Fermat’s Last Theorem, the Four Colour Theorem or the Mandelbrot Set.)

Secondly, recreational mathematics is mathematics that is fun and used as either as a diversion from serious mathematics or as a way of making serious mathematics understandable or palatable. These are the pedagogic uses of recreational mathematics. They are already present in the oldest known mathematics and continue to the present day.

These two aspects of recreational mathematics – the popular and the pedagogic – overlap considerably and there is no clear boundary between them and “serious” mathematics.

How is recreational mathematics useful?

Firstly, recreational problems are often the basis of serious mathematics. The most obvious fields are probability and graph theory where popular problems have been a major (or the dominant) stimulus to the creation and evolution of the subject. …

Secondly, recreational mathematics has frequently turned up ideas of genuine but non-obvious utility. …

Anyone who has tried to do anything with “real world” data knows how much of a puzzle it can represent: from finding the data, to getting hold of it, to getting it into a state and a shape where you can actually work with it, to analysing it, charting it, looking for pattern and structure within it, having a conversation with it, getting it to tell you one of the many stories it may represent, there are tricks to be learned and problems to be solved. And they’re fun.

An obvious definition [of recreational mathematics] is that it is mathematics that is fun, but almost any mathematician will say that he enjoys his work, even if he is studying eigenvalues of elliptic differential operators, so this definition would encompass almost all mathematics and hence is too general. There are two, somewhat overlapping, definitions that cover most of what is meant by recreational mathematics.

…the two definitions described above.

So how might we define “recreational data”. For me, recreational data activities are, in who or in part, data investigations, involving one or more steps of the data lifecycle (discovery, acquisition, cleaning, analysis, visualisation, storytelling). They are the activities I engage in when I look for, or behind, the numbers that appear in a news story. They’re the stories I read on FullFact, or listen to on the OU/BBC co-pro More or Less; they’re at the heart of the beautiful little book that is The Tiger That Isn’t; recreational data is what I do in the “Diary of a Data Sleuth” posts on OpenLearn.

Recreational data is about the joy of trying to find stories in data.

Recreational data is, or can be, the data journalism you do for yourself or the sense you make of the stats in the sports pages.

Recreational data is a safe place to practice – I tinker with Twitter and formulate charts around Formula One. But remember this: “recreational problems are often the basis of serious [practice]“. The “work” I did around Visualising Twitter User Timeline Activity in R? I can (and do) reuse that code as the basis of other timeline analyses. The puzzle of plotting connected concepts on Wikipedia I described in Visualising Related Entries in Wikipedia Using Gephi? It’s a pattern I can keep on playing with.

If you think you might like to do some doodle of your own with some data, why not check out the School Of Data. Or watch out on OpenLearn for some follow up stories from the OU/BBC co-pro of Hans Rosling’s award winning Don’t Panic