A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

– remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page”

Comments are closed.