Chit Chat with New Datasets – Facets in OpenRefine (Was /Google Refine/)

One of the many ways of using Google OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various faceting options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a precursor to a full blown conversation.

Faceted search or faceted browsing/navigation typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:

Where the facet relates to a categorical variable – that is, where there is a set of unique values that the facet can take (such as the names of different publishers) – a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For numerical facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical range), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.

In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges across the full range.

So what faceting options does Google Refine provide?

Here’s how they work (data used for the examples comes from Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi and JSON import from the Twitter search API…):

– exploring the set of categories described within a column using the text facet:

Faceted views also allow you to view the facet values by occurrence count, so it’s easy to see which the most popular facet values are:

You can also get a tab separated list of facet values:

Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):

– looking at the range of numerical values contained in a column using the numeric facet:

– looking at the distribution over time of column contents using the timeline facet:

Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a “time” string such as Mon, 29 Oct 2012 10:56:52 +0000 from the Twitter search API, we can use the GREL function toDate(value,"EEE, dd MMM y H:m:s") to create a new column with time-cast elements.

(See GRELDateFunctions and the Java SimpleDateFormat class documentation for more details.)

– getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the scatterplot facet.

This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:

Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:

A range of customisable faceting options are also provided that allow you to define your own faceting functions:

  • the Custom text… facet;
  • the Custom Numeric… facet

More conveniently, a range of predefined Customized facets are provided that provide shortcuts to “bespoke” faceting functions:

So for example:

  • the word facet splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet ngram(value,1) provides a similar (though not identical) result – duplicated words in a cell are identified as unique by the single word ngram function; see also split(value," "), which does seem to replicate the behaviour of the word facet function.)
  • the duplicates facet returns boolean values of true and false; filtering on true values returns all the rows that have duplicated values within a particular column; filtering on false displays all unique rows.
  • the text length facet produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet length(value) achieves something similar; the related measure, word count, can be achieved using the custom numeric facet length(split(value," "))

Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if any of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.

PS On the OpenRefine (was Google Refine) name change, see From Freebase Gridworks to Google Refine and now OpenRefine. The code repository is now on github: OpenRefine Repository. I also notice that openrefine.org/ has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an OpenRefine category that could feed a OpenRefinery aggregator channel).

PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens’ Using Open Refine for e-journal data.

Finding (Nearly) Duplicate Items in a Data Column

[WARNING – THIS IS A *BAD ADVICE* POST – it describes a trick that sort of works, but the example is contrived and has a better solution – text facet and then cluster on facet (h/t to @mhawksey’s Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Geocoding Using the Google Maps Geocoder via OpenRefine

On my to do list in the data education game is an open-ended activity based on collecting examples of how to do common tasks using a variety of different tools and techniques. One such task is geo-coding, getting latitude and longitude co-ordinates for a location based on it’s address or postcode. A couple of recent posts on the School of Data blog provide an Introduction to Geocoding and a howto on geocoding in Google spreadsheets. Here’s another approach, geocoding using OpenRefine [download and install OpenRefine] (see also Paul Bradshaw’s related post from a couple of years ago: Getting Full Addresses for Data from an FOI Response (Using APIs)).

First, we need to grab some data. I’m going to use a data file from the Food Standards Agency that contains food standards ratings for food outlets within a particular local council area. This data file is a relatively simple XML format that OpenRefine should have no trouble handling… Let’s get the data directly from the data file’s web address:

GetSomeDataIntoRefine

OpenRefine will now download the file and give us a preview. What we need to do now is highlight an example of a record element – OpenRefine will create one row for each such record in the original file.

Highlight the record element

(For the purposes of this example, it might have been easier to just select the postcode element within each establishment record.)

Click “Create Project” in the top right corner, and the data will be imported into a tabular view:

wehavedataIncluding postcodes

If we look along the columns, we notice a postcode element- we can use that as the basis for geocoding. (Yes, I know, I know, this dataset already has latitude and longitude data, so we don’t really need to get it for ourselves;-) But I thought this datafile was a handy source of postcodes;-) What we’re going to do now is ask OpenRefine to create a new column that will be populated with data from the Google maps geocoding API that gives the latitude and longitude for each postcode:

weshalladdacolumn

We can pass a postcode to the Google maps geocoder using a URL (web address) constructed as follows:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE"

(We could force the locale – that the postcode is in the UK – by using "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE,UK".)

*IF YOU WANT OPEN GEO-CODED DATA, GOOGLE MAPS WON’T CUT IT… *

Note also that we need to “escape” the postcode so that it works in the URL (this essentially means handling spaces in the postcode). To do this, we will construct a URL using the follow GREL programming language statement:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")

or alternatively:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")+",UK"

What this does is build up the URL from the value of the cells in the column from which our new column is derived. The escape(value,"url") command makes sure that the postcode value is correctly encoded for use in a url.

generate the geocoding URL

We can now let OpenRefine construct a URL containing the postcode for each row in the data file and load the requested data in from the URL (the throttle delay sets the rate at which OpenRefine will make repeated calls to the API – many APIs have a limit on both the rate at which you can call them (once a second, for example), as well as a limit on the total number of calls per hour, or day). The returned data will be added to the appropriate row in the new column:

ecample geo

The looks a bit confusing, doesn’t it? However, it does have some structure to it. Here’s how the data unpacks if we lay it out a bit better:

{ "results" : [
   { "address_components" : [
     { "long_name" : "NG4 4DL",
       "short_name" : "NG4 4DL",
       "types" : [ "postal_code" ] 
     },
     { "long_name" : "Gedling",
       "short_name" : "Gedling",
       "types" : [ "locality", "political" ]
     },
     { "long_name" : "Nottinghamshire",
       "short_name" : "Notts",
       "types" : [ "administrative_area_level_2", "political" ]
     },
     { "long_name" : "United Kingdom",
       "short_name" : "GB", "types" : [ "country", "political" ] 
     },
     { "long_name" : "Nottingham",
       "short_name" : "Nottingham",
       "types" : [ "postal_town" ] 
     }
     ],
     "formatted_address" : "Gedling, Nottinghamshire NG4 4DL, UK",
     "geometry" : {
         "bounds" : {
             "northeast" : { "lat" : 52.97819320, "lng" : -1.09096230 },
             "southwest" : { "lat" : 52.9770690, "lng" : -1.0924360 } 
         },
         "location" : {
           "lat" : 52.97764640,
           "lng" : -1.09182150 
         },
         "location_type" : "APPROXIMATE",
         "viewport" : {
           "northeast" : { "lat" : 52.97898008029150, "lng" : -1.090350169708498 },
           "southwest" : { "lat" : 52.97628211970850, "lng" : -1.093048130291502 }
         }
       },
       "types" : [ "postal_code" ] 
    }
 ], "status" : "OK" }

(For a little more insight into the structure of data like this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure.)

So OpenRefine has pulled back some data – we can now generate another column or two to extract the latitude and longitude the Google Maps geocoding API sent back:

colfromcol

We need to find a way of extracting the data element we want from the data we got back from the geocoding API – we can use another GREL programming language statement to do this:

openrefine jsonparse

Here’s how to construct that statement (try to read the following with reference to the laid out data structure shown above…):

  • value.parseJson() parses the data text (the value of the cell from the original column) into a data element we can work with programmatically
  • value.parseJson().results[0] gets the the data in the first “result” (computer programmes generally start a count with a 0 rather than a 1…)
  • value.parseJson().results[0][“geometry”] says get the “geometry” element from the first result;
  • value.parseJson().results[0][“geometry”][“location”] says get the location element from within that geometry element;;
  • value.parseJson().results[0][“geometry”][“location”][“lat”] says get the latitude element from within the corresponding location element.

Here’s what we get – a new column containing the latitude, as provided by the geocoder API from the postcode:

newcolgeo

Hopefully you can see how we would be able to create a new column containing the longitude: go back to the Googlegeo column, select Edit column – Add column based on this column for a second time, and construct the appropriate parsing statement (value.parseJson().results[0][“geometry”][“location”][“lng”]).

If OpenRefine is new to you, it’s worth reviewing what steps are involved in this recipe:

– loading in an XML data file from a web address/URL;
– generating a set of URLs that call a remote API/data source based on the contents of a particular column, and loading data from those URLs into a new column;
– parsing the data returned from the API to extract a particular data element and adding it to a new column.

Many of these steps are reusable in other contexts, of course…;-) (Can you think of any?!)

PS Tom Morris pointed out to me (on the School of Data mailing list) that the Google Maps geocoding API terms of use only allow the use of the API for geocoding points in order to locate them on a Google Map… Which is probably not what you’d want this data for…;-) However, the recipe is a general one – if you can find another geocoding service that accepts an escaped address term as part of a URL and returns geo-data in a JSON format, you should be able to work out how to use that API instead using this self-same recipe… (Just build a different URL and parse the returned JSON appropriately…)

I’ll try to keep a list of example API URLs for more open geocoders here:

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London (no postcodes?)
http://mapit.mysociety.org/postcode/MK7%206AA (UK postcode geocoder)

First Dabblings with the Gateway to Research API Using OpenRefine

Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.

So let’s do a search for the Open University, and limit by organisation:

Gtr Open university search

Here’s the OU’s organisation “homepage” on GtR:

gtr ou homepage

A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24

Just add a .json or .xml suffix to get a link to a machine readable version of the data:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json

(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)

Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json?page=1&selectedSortableField=startDate&selectedSortOrder=DESC&fetchSize=100

(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)

To grab additional pages, simply increase the page count.

In OpenRefine, we can load the JSON in from four URLs:

OpenRefine import data

Yep – looks like the data has been grabbed okay…

open refine - the data has been grabbed

We can now select which blocks of data we want to import as unique rows…

open refine  - identify the import records

OpenRefine nows shows a preview – looks okay to me…

Open Refine - check the preview

Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:

Open refine Cast to date

We can then use a time facet to explore the distribution of bids with a particular start date, for example:

Open Refine - timeline facet

Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)

What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:

open refine funding facet

One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.

How about the funders – and maybe the number of projects funder? A text facet can help us there:

open refine text facet

Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.

So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.

If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):

gtr - project data

The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.

Open Refine - add col by URL

We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:

open refine scraper downloader

Data down:-) Now we need to parse it and create columns based on the project data download:

open refine - data down - need to parse it

One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):

gtr project json

Here’s how we can pull out the Grant Reference:

open refine parse json

Let’s briefly unpick that:

value.parseJson()['projectComposition']['project']['grantReference']

The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)

Here are a few more incantations we can cast to grab particular data elements out:

  • the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
  • the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']

open refine new cols

We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:

forEach(filter(value.parseJson()['projectComposition']['projectPerson'],v,v['projectRole'][0]=='PRINCIPAL_INVESTIGATOR'),x,x['name']).join('|')

We can also pull out a list of the Organisations associated with the project:

forEach(value.parseJson()['projectComposition']['organisation'],v,v['name']).join('|')

open refine - more data extracts

Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.

The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.

open refine split multi valued cells

We need to split on the | delimiter I specified in the organisations related expression above.

open refine split separated

The result of the split are new rows – but lots of empty columns…

open refien the result of split

We can fill in the values by using “Fill Down” on columns we want to be populated:

open refine fill dwon

Magic:

open refine filled

I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.

The final step is to export out data, using a custom export format:

open refine cusotm export

This exporter allows us to select – and order – the columns we want to export:

open refine exporter col select

We can also select the output format:

open refine export format select

Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:

Importing into R

and then viewing it:

data in R

NOTE: I haven’t checked any of this, just blog walked through it..!;-)

A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

– remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)

A Wrangling Example With OpenRefine: Making “Oven Ready Data”

As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:

onthewight results

There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:

Slightly tidier

Or how about something like this, that shows the size of the electorate for each ward:

turnout

So how can we get from the OnTheWight results into a ready data format?

Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-c); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:

Paste data into OpenRefine

* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:

_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714

_Votes won by figures – filled in
Lots of the ‘Votes won by figures’ had the wrong number in them. It’s one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn’t possible.

‘Postal votes (inc)’ line inserted between ‘Total votes cast’ and ‘Papers spoilt’

Deleted an empty row from Ventnor West

The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.

import data as TSV no blanks

You also need to ensure that the Parse cell text into number/dates… option is selected.

OpenRefine

Here’s what we end up with:

ELection data raw import

The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:

COpy a column

Let’s define the new column as having exactly the same value as the original column:

Create new col as copy of old

Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)

The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:

Facet the numbers

Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.

IW_results_txt_-_OpenRefine

Then remove the numeric facet filter:

filter update

Here’s the result, much tidier:

Much tidier

Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:

Selectt OpenRefine filter

Simply state the value you want to filter on and blitz the matching rows…

CHoose rows then blitz them

…then remove the filter:

then remove the filter

We can now fill down a the blanks in the Electoral Division column:

Fill down on Electoral Division

Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.

Filled down - now flag unwanted row

Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:

facet on flagged row

Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.

Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:

Remove header rows

And then this time just retain the non-numeric rows.

Electoral ward properties

Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:

Rename column

Like so:

columns renamed

Now let’s just make those comma mangled numbers into numbers, by transforming them:

transform the cells by removeing commas

The transform we’re going to use is to replace the comma by nothing:

replace comma

Then convert the values to a number type.

then convert to number

We can the do the same thing for the Number on Roll column:

reuse is good

We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.

Final stages of electroal division data

There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.

a little more tidying to do

Having done that, we can take the opportunity to export the data.

openrefine exporter

Using the custom tabular exporter, we can select the columns we wish to export.

Export column select

Then we can export the data to the desktop as a file in a variety of formats:

OPenrefine export download

Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:

OPenRefine upload to goole

Here’s the data I uploaded.

If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)

change filter settings

When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:

reorder columns

Here’s what I uploaded to a Google spreadsheet:

Spreadsheet

[OpenRefine project file]

So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…

Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.

Questioning Election Data to See if It Has a Story to Tell

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson’s Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

o:sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.

Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine

As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?

Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.

libdems million jobs

If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…

libdems jobs view source

Ah ha – it’s an embedded map from a Google Fusion Table…

https://www.google.com/fusiontables/embedviz?q=select+col0+from+1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ&viz=MAP&h=false&lat=52.5656923458786&lng=-1.0353351498047232&t=1&z=7&l=col0&y=2&tmplt=3

We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:

https://www.google.com/fusiontables/DataSource?docid=1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ

Lib dems jobs Fusion tables

The description data is curtailed, but we can see the full description on the card view:

Lib dems fusion tables card

Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.

lib dems jobs data in OpenRefine

This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.

Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:

OpenRefine transpose cell rows to cols2

And as if by magic, we have recreated a workable table:-)

Openrefine rows transposed to cols

If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?

openrefinelibdem jobs text facet

If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?

counting job sites

What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?

Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:

libdems - have given text

Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:

import re
tmp=value
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
return tmp

libdems have given amount

Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.

If we cast these values to a number:

openrefine convert given to number

we can then use a numeric facet to help us explore the amounts.

libdems value numeric facet

In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:

  • Summary: Lib Dems in Government have secured a £73,000 grant for …
  • Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
  • Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
  • Summary: Lib Dems in Government have given £982,000 to …

Here’s an improved regular expression for parsing out some more of these amounts:

import re
tmp=value
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)

if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

So now we can start to identify some of the bigger grants…

libdems jobs big amounts

More to add? eg around:
...have secured a £150,000 grant...
Summary: Lib Dems have given a £1,571,000 grant...
Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...

Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?

how many jobs from road junction improvements?

In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:

import re
tmp=value
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

Lib dems jobs created

If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:

  • £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
  • £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant

But that’s for another day…

If you want to play with the data yourself, you can find it here.

Do Road Improvements *Really* Create Jobs?

Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone, helping to reduce journey times and create 10,400 new jobs. Really? 10,400 new jobs?

In Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine I had a little poke around some of the data that was used to power a map on a Lib Dems’ website, A Million Jobs:

Liberal Democrats have helped businesses create over 1 million new private sector jobs. Click on the map below to find out what we’ve done where you live.

And then there was the map…

libdems million jobs

One thing we might take away from this as an assumption is that the markers correspond to locations or environs where jobs were created, and that by adding up the number of jobs created at those locations, we would get to a number over a million.

Whilst I was poking through the data that powers the map, I started to think this might be an unwarranted assumption. I also started to wonder about how the “a million jobs” figure was actually calculated?

Using a recipe described in the Critiquing Data Stories post, I pulled out marker descriptions containing the phrase “helping to reduce journey” along with the number of jobs created (?!) associated with those claims, where a number was specified.

Lib Dems Road jobs

Claims were along the lines of:

Summary: Lib Dems in Government have allocated £2,600,000 to fund the A38 Markeaton improvements , helping to reduce journey times and create 12,300 new jobs. The project will also help build 3,300 new homes.

Note that as well as claims about jobs, we can also pull out claims about homes.

Extract homes

If we use OpenRefine’s Custom Tabular Exporter to upload the data to a Google spreadsheet (here) we can use the Google Spreadsheet-as-a-database query tool (as described in Asking Questions of Data – Garment Factories Data Expedition) to sum the total number of jobs “created” by road improvements (from the OpenRefine treatment, I had observed the rows were all distinct – the count of each text facet was 1).

SImple job count query

The sum of jobs “created”? 468, 184. A corresponding sum for the number of homes gives 203,976.

Looking at the refrain through the descriptions, we also notice that the claim is along the lines of: “Lib Dems in Government have allocated £X to fund [road improvement] helping to reduce journey times and create Y new jobs. The project will also help build Z new homes.” Has allocated. So it’s not been spent yet? [T]o create X new jobs. So they haven’t been created yet? And if those jobs are the result of other schemes made possible by road improvements, numbers will be double counted? [W]ill also help build So the home haven’t been built yet, but may well be being claimed as achievements elsewhere?

Note that the numbers I calculated are lower bounds, based on scheme descriptions that contained the specified search phrase and (“helping to reduce journey”) and a job numbers specified according to the pattern detected by the following Jython regular expression:

import re
tmp=value
tmp=re.sub(r'.* creat(e|ing) ([0-9,\.]*) new jobs.*',r'\2',tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

In addition, the housing numbers were extracted only from rows where a number of jobs was identified by that regular expression, and where they were described in a way that could be extracted using the following the Jython regular expression re.sub(r'.* The project will also help build ([0-9,\.]*) new homes.*',r'\1',tmp)

PS I’m reading The Smartest Guys in the Room at the moment, learning about the double counting and accounting creativity employed by Enron, and how confusing publicly reported figures often went unchallenged…

It also makes me wonder about phrases like “up to” providing numbers that are then used when calculating totals?

City Deal jobs - up to

So there’s another phrase to look for, maybe? have agreed a new ‘City Deal’ with

Generating Sankey Diagrams from rCharts

A couple of weeks or so ago, I picked up an inlink from an OCLC blog post about Visualizing Network Flows: Library Inter-lending. The post made use of Sankey diagrams to represent borrowing flows, and by implication suggested that the creation of such diagrams is not as easy as it could be…

Around the same time, @tiemlyportfolio posted a recipe for showing how to wrap custom charts so that they could be called from the amazing Javascript graphics library wrapping rCharts (more about this in a forthcoming post somewhere…). rCharts Extra – d3 Horizon Conversion provides a walkthrough demonstrating how to wrap a d3.js implemented horizon chart so that it can be generated from R with what amounts to little more than a single line of code. So I idly tweeted a thought wondering how easy it would be to run through the walkthrough and try wrapping a Sankey diagram in the same way (I didn’t have time to try it myself at that moment in time.)

Within a few days, @timelyportfolio had come up with the goods – Exploring Networks with Sankey and then a further follow on post: All My Roads Lead Back to Finance–PIMCO Sankey. The code itself can be found at https://github.com/timelyportfolio/rCharts_d3_sankey

Somehow, playtime has escaped me for the last couple of weeks, but I finally got round to trying the recipe out. The data I opted for is energy consumption data for the UK, published by DECC, detailing energy use in 2010.

As ever, we can’t just dive straight into the visualiastion – we need to do some work first to get int into shape… The data came as a spreadsheet with the following table layout:

Excel - copy data

The Sankey diagram generator requires data in three columns – source, target and value – describing what to connect to what and with what thickness line. Looking at the data, I thought it might be interesting to try to represent as flows the amount of each type of energy used by each sector relative to end use, or something along those lines (I just need something authentic to see if I can get @timelyportfolio’s recipe to work;-) So it looks as if some shaping is in order…

To tidy and reshape the data, I opted to use OpenRefine, copying and pasting the data into a new OpenRefine project:

Refine - paste DECC energy data

The data is tab separated and we can ignore empty lines:

Refine - paste import settings (DECC)

Here’s the data as loaded. You can see several problems with it: numbers that have commas in them; empty cells marked as blank or with a -; empty/unlabelled cells.

DECC data as imported

Let’s make a start by filling in the blank cells in the Sector column – Fill down:

DECC data fill down

We don’t need the overall totals because we want to look at piecewise relations (and if we do need the totals, we can recalculate them anyway):

DECC filter out overall total

To tidy up the numbers so they actually are numbers, we’re going to need to do some transformations:

DECC need to clean numeric cols

There are several things to do: remove commas, remove – signs, and cast things as numbers:

DECC clean numeric column

value.replace(',','') says replace commas with an empty string (ie nothing – delete the comma).

We can then pass the result of this transformation into a following step – replace the – signs: value.replace(',','').replace('-','')

Then turn the result into a number: value.replace(',','').replace('-','').toNumber()

If there’s an error, not that we select to set the cell to a blank.

having run this transformation on one column, we can select Transform on another column and just reuse the transformation (remembering to set the cell to blank if there is an error):

DECC number cleaner reuse

To simplify the dataset further, let’s get rid of he other totals data:

DECC remove data column

Now we need to reshape the data – ideally, rather than having columns for each energy type, we want to relate the energy type to each sector/end use pair. We’re going to have to transpose the data…

DECC start to reshape

So let’s do just that – wrap columns down into new rows:

DECC data transposition

We’re going to need to fill down again…

DECC need to fill down again

So now we have our dataset, which can be trivially exported as a CSV file:

DECC export as CSV

Data cleaning and shaping phase over, we’re now ready to generate the Sankey diagram…

As ever, I’m using RStudio as my R environment. Load in the data:

R import DECC data

To start, let’s do a little housekeeping:

#Here’s the baseline column naming for the dataset
colnames(DECC.overall.energy)=c(‘Sector’,’Enduse’,’EnergyType’,’value’)

#Inspired by @timelyportfolio - All My Roads Lead Back to Finance–PIMCO Sankey
#http://timelyportfolio.blogspot.co.uk/2013/07/all-my-roads-lead-back-to-financepimco.html

#Now let's create a Sankey diagram - we need to install RCharts
##http://ramnathv.github.io/rCharts/
require(rCharts)

#Download and unzip @timelyportfolio's Sankey/rCharts package
#Take note of where you put it!
#https://github.com/timelyportfolio/rCharts_d3_sankey

sankeyPlot <- rCharts$new()

#We need to tell R where the Sankey library is.
#I put it as a subdirectory to my current working directory (.)
sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')

#We also need to point to an HTML template page
sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")

having got everything set up, we can cast the data into the form the Sankey template expects – with source, target and value columns identified:

#The plotting routines require column names to be specified as:
##source, target, value
#to show what connects to what and by what thickness line

#If we want to plot from enduse to energytype we need this relabelling
workingdata=DECC.overall.energy
colnames(workingdata)=c('Sector','source','target','value')

Following @timelyportfolio, we configure the chart and then open it to a browser window:

sankeyPlot$set(
  data = workingdata,
  nodeWidth = 15,
  nodePadding = 10,
  layout = 32,
  width = 750,
  height = 500,
  labelFormat = ".1%"
)

sankeyPlot

Here’s the result:

Basic sankey DECC

Let’s make plotting a little easier by wrapping that routine into a function:

#To make things easier, let's abstract a little more...
sankeyPlot=function(df){
  sankeyPlot <- rCharts$new()
  
  #--------
  #See note in PPS to this post about a simplification of this part....
  #We need to tell R where the Sankey library is.
  #I put it as a subdirectory to my current working directory (.)
  sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
  
  #We also need to point to an HTML template page
  sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
  #---------
 
  sankeyPlot$set(
    data = df,
    nodeWidth = 15,
    nodePadding = 10,
    layout = 32,
    width = 750,
    height = 500,
    labelFormat = ".1%"
  )
  
  sankeyPlot
}

Now let’s try plotting something a little more adventurous:

#If we want to add in a further layer, showing how each Sector contributes
#to the End-use energy usage, we need to additionally treat the Sector as
#a source and the sum of that sector's energy use by End Use
#Recover the colnames so we can see what's going on
sectorEnergy=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(sectorEnergy)=c('source','target','value')

#We can now generate a single data file combing all source and target data
energyfull=subset(workingdata,select=c('source','target','value'))
energyfull=rbind(energyfull,sectorEnergy)

sankeyPlot(energyfull)

And the result?

Full Sankey DECC

Notice that the bindings are a little bit fractured – for example, the Heat block has several contributions from the Gas block. This also suggests that a Sankey diagram, at least as configured above, may not be the most appropriate way of representing the data in this case. Sankey diagrams are intended to represent flows, which means that there is a notion of some quantity flowing between elements, and further that that quantity is conserved as it passes through each element (sum of inputs equals sum of outputs).

A more natural story might be to show Energy type flowing to end use and then out to Sector, at least if we want to see how energy is tending to be used for what purpose, and then how end use is split by Sector. However, such a diagram would not tell us, for example, that Sector X was dominated in its use of energy source A for end use P, compared to Sector Y mainly using energy source B for the same end use P.

One approach we might take to tidying up the chart to make it more readable (for some definition of readable!), though at the risk of making it even more misleading, is to do a little bit more aggregation of the data, and then bind appropriate blocks together. Here are a few more examples of simple aggregations:

We can also explore other relationships and trivially generate corresponding Sankey diagram views over them:

#How much of each energy type does each sector use
enduseBySector=aggregate(value ~ Sector + Enduse, DECC.overall.energy, sum)
colnames(enduseBySector)=c('source','target','value')
sankeyPlot(enduseBySector)

colnames(enduseBySector)=c('target','source','value')
sankeyPlot(enduseBySector)

#How much of each energy type is associated with each enduse
energyByEnduse=aggregate(value ~ EnergyType + Enduse, DECC.overall.energy, sum)
colnames(energyByEnduse)=c('source','target','value')

sankeyPlot(energyByEnduse)

So there we have it – quick and easy Sankey diagrams from R using rCharts and magic recipe from @timelyportfolio:-)

PS the following routine makes it easier to grab data into the appropriately named format

#This routine makes it easier to get the data for plotting as a Sankey diagram
#Select the source, target and value column names explicitly to generate a dataframe containing
#just those columns, appropriately named.
sankeyData=function(df,colsource='source',coltarget='target',colvalue='value'){
  sankey.df=subset(df,select=c(colsource,coltarget,colvalue))
  colnames(sankey.df)=c('source','target','value')
  sankey.df
}

#For example:
data.sdf=sankeyData(DECC.overall.energy,'Sector','EnergyType','value')
data.sdf

The code automatically selects the appropriate columns and renames them as required.

PPS it seems that a recent update(?) to the rCharts library by @ramnath_vaidya now makes things even easier and removes the need to download and locally host @timelyportfolio’s code:

#We can remove the local dependency and replace the following...
#sankeyPlot$setLib('./rCharts_d3_sankey-gh-pages/')
#sankeyPlot$setTemplate(script = "./rCharts_d3_sankey-gh-pages/layouts/chart.html")
##with this simplification
sankeyPlot$setLib('http://timelyportfolio.github.io/rCharts_d3_sankey')