Archive for the ‘OpenRefine’ Category
I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…
So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?
Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):
Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:
So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.
In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)
Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:
In OpenRefine, we can Create a New Project and then import the data directly:
The data is in comma separated CSV format, so let’s specify that:
We can then name and create the project and we’re ready to start…
…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.
Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.
And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:
Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT
Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT
The %20 is a character encoding for a space. In this case, we can also use a +.
So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)
Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.
To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.
What we want to do then is to create a new, derived column based on the address:
And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:
Here’s the result…
Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:
When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:
The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.
When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…
We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:
We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:
filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )['name']
to create a new column containing the electoral division:
Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:
If we collapse down the spare columns, we get a clearer picture:
If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):
We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…
Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)
There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson's Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?
One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:
'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')
The data, as imported from the Ordnance Survey, looks something like this:
As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.
Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.
The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:
Here’s what the returned lat/long data looks like:
A similar trick can be used to generate a column containing just the longitude data.
We can then export a view over the data to a CSV file, or direct to Google Fusion tables.
With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222
Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:
We can inspect the location data using a more convenient “natural” view over it…
By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:
(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)
We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:
Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:
Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…
So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.
As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:
There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:
Or how about something like this, that shows the size of the electorate for each ward:
So how can we get from the OnTheWight results into a ready data format?
Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-v); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:
* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:
_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714
_Votes won by figures - filled in
Lots of the 'Votes won by figures' had the wrong number in them. It's one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn't possible.
'Postal votes (inc)' line inserted between 'Total votes cast' and 'Papers spoilt'
Deleted an empty row from Ventnor West
The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.
Here’s what we end up with:
The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:
Let’s define the new column as having exactly the same value as the original column:
Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)
The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:
Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.
Then remove the numeric facet filter:
Here’s the result, much tidier:
Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:
Simply state the value you want to filter on and blitz the matching rows…
…then remove the filter:
We can now fill down a the blanks in the Electoral Division column:
Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.
Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:
Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.
Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:
And then this time just retain the non-numeric rows.
Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:
Now let’s just make those comma mangled numbers into numbers, by transforming them:
The transform we’re going to use is to replace the comma by nothing:
Then convert the values to a number type.
We can the do the same thing for the Number on Roll column:
We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.
There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.
Having done that, we can take the opportunity to export the data.
Using the custom tabular exporter, we can select the columns we wish to export.
Then we can export the data to the desktop as a file in a variety of formats:
Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:
Here’s the data I uploaded.
If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)
When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:
Here’s what I uploaded to a Google spreadsheet:
So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…
Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.
Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.
Here’s how I want the data to look after I’ve cleaned it:
Can you see what needs to be done? Somehow, we need to:
- remove the blank lines;
- generate a second column containing the name of the ward each address applies to;
- remove the colon from the ward name;
- remove the rows that contained the original ward names.
If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:
We can also paste the data into a new OpenRefine Project:
We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:
But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?
Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.
Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.
If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.
Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.
If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.
Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…
…by replacing the colon with nothing (an empty string).
Here’s the data – neat and tidy:-)
To finish, let’s export the data.
How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).
And here it is:-)
So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)
One of the many ways of using
Faceted search or faceted browsing/navigation typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:
Where the facet relates to a categorical variable – that is, where there is a set of unique values that the facet can take (such as the names of different publishers) – a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For numerical facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical range), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.
In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges across the full range.
So what faceting options does Google Refine provide?
Here’s how they work (data used for the examples comes from Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi and JSON import from the Twitter search API…):
Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):
- looking at the distribution over time of column contents using the timeline facet:
Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a “time” string such as Mon, 29 Oct 2012 10:56:52 +0000 from the Twitter search API, we can use the GREL function toDate(value,"EEE, dd MMM y H:m:s") to create a new column with time-cast elements.
(See GRELDateFunctions and the Java SimpleDateFormat class documentation for more details.)
- getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the scatterplot facet.
This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:
Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:
A range of customisable faceting options are also provided that allow you to define your own faceting functions:
- the Custom text… facet;
- the Custom Numeric… facet
More conveniently, a range of predefined Customized facets are provided that provide shortcuts to “bespoke” faceting functions:
So for example:
- the word facet splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet ngram(value,1) provides a similar (though not identical) result – duplicated words in a cell are identified as unique by the single word ngram function; see also split(value," "), which does seem to replicate the behaviour of the word facet function.)
- the duplicates facet returns boolean values of true and false; filtering on true values returns all the rows that have duplicated values within a particular column; filtering on false displays all unique rows.
- the text length facet produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet length(value) achieves something similar; the related measure, word count, can be achieved using the custom numeric facet length(split(value," "))
Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if any of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.
PS On the OpenRefine (was Google Refine) name change, see From Freebase Gridworks to Google Refine and now OpenRefine. The code repository is now on github: OpenRefine Repository. I also notice that openrefine.org/ has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an OpenRefine category that could feed a OpenRefinery aggregator channel).
PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens’ Using Open Refine for e-journal data.