A Wrangling Example With OpenRefine: Making “Oven Ready Data”
As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:
There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:
Or how about something like this, that shows the size of the electorate for each ward:
So how can we get from the OnTheWight results into a ready data format?
Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-v); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:
* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:
_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714
_Votes won by figures - filled in
Lots of the 'Votes won by figures' had the wrong number in them. It's one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn't possible.
'Postal votes (inc)' line inserted between 'Total votes cast' and 'Papers spoilt'
Deleted an empty row from Ventnor West
The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.
Here’s what we end up with:
The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:
Let’s define the new column as having exactly the same value as the original column:
Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)
The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:
Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.
Then remove the numeric facet filter:
Here’s the result, much tidier:
Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:
Simply state the value you want to filter on and blitz the matching rows…
…then remove the filter:
We can now fill down a the blanks in the Electoral Division column:
Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.
Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:
Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.
Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:
And then this time just retain the non-numeric rows.
Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:
Now let’s just make those comma mangled numbers into numbers, by transforming them:
The transform we’re going to use is to replace the comma by nothing:
Then convert the values to a number type.
We can the do the same thing for the Number on Roll column:
We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.
There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.
Having done that, we can take the opportunity to export the data.
Using the custom tabular exporter, we can select the columns we wish to export.
Then we can export the data to the desktop as a file in a variety of formats:
Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:
Here’s the data I uploaded.
If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)
When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:
Here’s what I uploaded to a Google spreadsheet:
So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…
Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.