A Wrangling Example With OpenRefine: Making “Oven Ready Data”

As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:

onthewight results

There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:

Slightly tidier

Or how about something like this, that shows the size of the electorate for each ward:

turnout

So how can we get from the OnTheWight results into a ready data format?

Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-c); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:

Paste data into OpenRefine

* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:

_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714

_Votes won by figures – filled in
Lots of the ‘Votes won by figures’ had the wrong number in them. It’s one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn’t possible.

‘Postal votes (inc)’ line inserted between ‘Total votes cast’ and ‘Papers spoilt’

Deleted an empty row from Ventnor West

The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.

import data as TSV no blanks

You also need to ensure that the Parse cell text into number/dates… option is selected.

OpenRefine

Here’s what we end up with:

ELection data raw import

The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:

COpy a column

Let’s define the new column as having exactly the same value as the original column:

Create new col as copy of old

Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)

The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:

Facet the numbers

Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.

IW_results_txt_-_OpenRefine

Then remove the numeric facet filter:

filter update

Here’s the result, much tidier:

Much tidier

Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:

Selectt OpenRefine filter

Simply state the value you want to filter on and blitz the matching rows…

CHoose rows then blitz them

…then remove the filter:

then remove the filter

We can now fill down a the blanks in the Electoral Division column:

Fill down on Electoral Division

Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.

Filled down - now flag unwanted row

Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:

facet on flagged row

Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.

Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:

Remove header rows

And then this time just retain the non-numeric rows.

Electoral ward properties

Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:

Rename column

Like so:

columns renamed

Now let’s just make those comma mangled numbers into numbers, by transforming them:

transform the cells by removeing commas

The transform we’re going to use is to replace the comma by nothing:

replace comma

Then convert the values to a number type.

then convert to number

We can the do the same thing for the Number on Roll column:

reuse is good

We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.

Final stages of electroal division data

There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.

a little more tidying to do

Having done that, we can take the opportunity to export the data.

openrefine exporter

Using the custom tabular exporter, we can select the columns we wish to export.

Export column select

Then we can export the data to the desktop as a file in a variety of formats:

OPenrefine export download

Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:

OPenRefine upload to goole

Here’s the data I uploaded.

If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)

change filter settings

When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:

reorder columns

Here’s what I uploaded to a Google spreadsheet:

Spreadsheet

[OpenRefine project file]

So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…

Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: