Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
https://github.com/OpenRefine/OpenRefine/issues/1226
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells > Transform > Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform > Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

Converting Spreadsheet Rows to Text Based Summary Reports Using OpenRefine

In Writing Each Row of a Spreadsheet as a Press Release? I demonstrated how we could generate a simple textual report template that could “textualise” separate rows of a spreadsheet. This template could be applied to each row from a subset of rows to to produce a simple human readable view of the data contained in each of those rows. I picked up on the elements of this post in Robot Journalists or Robot Press Secretaries?, where I reinforced the idea that such an approach was of a similar kind to the approach used in mail merge strategies supported by many office suites.

It also struck me that we could use OpenRefine’s custom template export option to generate a similar sort of report. So in this post I’ll describe a simple recipe for recreating the NHS Complaints review reports from a couple of source spreadsheets using OpenRefine.

This is just a recasting of the approach demonstrated in the Writing Each Row… post, and more fully described in this IPython notebook, so even if you don’t understand Python, it’s probably worth reviewing those just to get a feeling of the steps involved.

To start with, let’s see how we might generate a basic template from the complaints CSV file, loaded in with the setting to parse numerical columns as such.

OpenRefine

The default template looks something like this:

default template

We can see how each the template provides a header slot, for the start of the output, a template applied to each row, a separator to spilt the rows, and a footer.

The jsonize function makes sure the output is suitable for output as a JSON file. We just want to generate text so we can forget that.

Here’s the start of a simple report…

Report for {{cells["Practice_Code"].value}} ({{cells["Year"].value}}):

  Total number of written complaints received:
  - by area: {{cells["Total number of written complaints received"].value}} (of which, {{cells["Total number of written 
complaints upheld"].value}} upheld)
  - by subject: {{cells["Total number of written complaints received 2"].value}} (of which, {{cells["Total number of written 
complaints upheld 2"].value}} upheld)

custom_export _start

The double braces ({{ }} allow you to access GREL statements. Outside the braces, the content is treated as text.

Note that the custom template doesn’t get saved… I tend to write the custom templates in a text editor, then copy and paste them into OpenRefine.

We can also customise the template with some additional logic using the if(CONDITION, TRUE_ACTION, FALSE_ACTION) construction. For example, we might flag a warning that a lot of complaints were upheld:

openrefine template warning

The original demonstration pulled in additional administrative information (practice name and address, for example) from another source spreadsheet. Merging Datasets with Common Columns in Google Refine describes a recipe for merging in data from another dataset. In this case, if our source is the epraccur spreadsheet, we can create an OpenRefine project from the epraccur spreadsheet (use no lines as the header – it doesn’t have a header row) and then merge in data from the epraccur project into the complaints project using the practice code (Column 1 in the epraccur project) as the key column used to add an additional practice name column based on the Practice_Code column in the complaints project – cell.cross("epraccur xls", "Column 1").cells["Column 2"].value[0]

Note that columns can only be merged in one column at a time.

In order to filter the rows so we can generate reports for just the Isle of Wight, we also need to merge in the Parent Organisation Code (Column 15) from the epraccur project. To get Isle of Wight practices, we could then filter on code 10L. If we then used out custom exporter template, we could get just textual reports for the rows corresponding to Isle of Wight GP practices.

nhs openrefine filter

Teasing things apart a bit, we also start to get a feel for a more general process. Firstly, we can create a custom export template to generate a textual representation of each row in a dataset. Secondly, we can use OpenRefine’s filtering tools to select which rows we want to generate reports from, and order them appropriately. Thirdly, we could also generate new columns containing “red flags” or news signals associated with particular rows, and produce a weighted sum column on which to rank items in terms of newsworthiness. We might also want to merge in additional data columns from other sources, and add elements from those in to the template. Finally, we might start to refine the export template further to include additional logic and customisation of the news release output.

See also Putting Points on Maps Using GeoJSON Created by Open Refine for a demo of how to generate a geojson file using the OpenRefine custom template exporter as part of a route to getting points onto a map.

OpenRefine Docker Containers

I had a go at building a couple of docker containers for OpenRefine, one from the latest release and one from the latest code on github:

In order to create the virtual machine, you should:

  • install boot2docker
  • run boot2docker
  • Either: to run with a project directory solely within the container, in the boot2docker terminal, enter the command docker run --name openrefine -d -p 3334:3333 psychemedia/openrefine
  • Or: to run with a project directory mounted from a shared folder on the host, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /path/to/yourSharedDirectory:/mnt/refine --name openrefine psychemedia/openrefine
  • Or: to run with a project directory in a linked data volume, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /mnt/refine --name openrefine psychemedia/openrefine

(To use the latest release rather than a recent build use psychemedia/docker-openrefine rather than psychemedia/openrefine.)

The port number you will be able to find OpenRefine on is given by the first number set in the flag -p NNNN:3333. To access OpenRefine via port 3334, use -p 3334:3333 etc.

OpenRefine will then be available via your browser at the URL http://IPADDRESS:NNNN. To find the required value of IPADDRESS can be found using the command boot2docker ip

The returned IP address (eg 192.168.59.103) is the IP address you can find OpenRefine on, for example: http://192.168.59.103:3334.

PS for a Dockerfile in the current directory, to build an image from the Dockerfile, use something like:

docker build -t myname/mycontainer .

ScreenScraping HTML Web Pages With OpenRefine – Norwegian Oil Company Data

[An old post, rescued from the list of previously unpublished posts…]

Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.

The setting actually comes from OpenOil’s Johnny West:

1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions

Let’s see how far we can get…

To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:

– looking up company ID can be constructed around http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]

http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.

Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:

openRefine xls import

So that’s step 1…

We can run step 2 as follows* – create a new column from the company column:

* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…

openRefine add new col

Here’s how we construct the URL:

OpenRefine - get new col by URL

The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.

table to scrape - crib

Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:

[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() ].join('::')

The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.

The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() – pulls out the company name from the first search result.

We place these two parts into an array and then join them with two colons: [].join('::')

This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.

open refine - compare names

We can now split the name/ID pair column into two separate columns:

openRefine spilt column into cols

And the result:

openrefne  cols now split

The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:

openrefine add another col by URL

(We could probably reduce the throttle time by an order of magnitude!)

The next step, step 4, is to pull out shareholders and their percentages.

The first step is to grab the shareholder table and each of the rows, which in the original looked like this:

shareholders table

The following hack seems to get us the rows we require:

[REMOVED]

BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(

fckwt

So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(

Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(

Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):

<div id="companyOwnership" class="box">
	<div class="boxHeader">Shareholders:</div>
	<div class="boxContent">
		<div class="row rowHeading">
			<label class="fl" style="width: 70%;">Company name:</label>
			<label class="fl" style="width: 30%;">Percentage share (%)</label>
			<div class="cb"></div>
		</div>
		<div class="row odd">
			<label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label>
			<div class="fr" style="width: 30%;">100.00%</div>
			<div class="cb"></div>
		</div>
	</div>

For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:

forEach(value.parseHtml().select('div[id=companyOwnership]')[-1].select('div.row'),e,e).join('::')

openrefine last company ownership

We can then generate one row for each shareholder in OpenRefine:

open refine - spilt

(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)

For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…

value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')

horrible hack openrefine

Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):

openrefine column split

We then need to tidy the two columns:

value.replace("<\/a>",'').replace(/.*>/,'')

Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…

value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()

We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually – this is not right, as we can see below for the line Altinex Oil Norway AS…? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).

This is what we get:

COmpany ownership

We can then export this file, before considering steps 5 and 6, using the custom exporter:

open refine exporter

Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…

column export

And then select the export format:

column export format

Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.

Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments… The original link will be the one used in the OpenRefine project…]

Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?

In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.

Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…

This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?

Diff or Chop? Github, CSV data files and OpenRefine

A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:

csv diff

With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.

* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.

So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.

One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:

openrefine history

Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.

A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.

Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:

open refine extract

This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?

For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.

OPen Refine export to google

So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)

For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)

And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.

Do Road Improvements *Really* Create Jobs?

Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone, helping to reduce journey times and create 10,400 new jobs. Really? 10,400 new jobs?

In Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine I had a little poke around some of the data that was used to power a map on a Lib Dems’ website, A Million Jobs:

Liberal Democrats have helped businesses create over 1 million new private sector jobs. Click on the map below to find out what we’ve done where you live.

And then there was the map…

libdems million jobs

One thing we might take away from this as an assumption is that the markers correspond to locations or environs where jobs were created, and that by adding up the number of jobs created at those locations, we would get to a number over a million.

Whilst I was poking through the data that powers the map, I started to think this might be an unwarranted assumption. I also started to wonder about how the “a million jobs” figure was actually calculated?

Using a recipe described in the Critiquing Data Stories post, I pulled out marker descriptions containing the phrase “helping to reduce journey” along with the number of jobs created (?!) associated with those claims, where a number was specified.

Lib Dems Road jobs

Claims were along the lines of:

Summary: Lib Dems in Government have allocated £2,600,000 to fund the A38 Markeaton improvements , helping to reduce journey times and create 12,300 new jobs. The project will also help build 3,300 new homes.

Note that as well as claims about jobs, we can also pull out claims about homes.

Extract homes

If we use OpenRefine’s Custom Tabular Exporter to upload the data to a Google spreadsheet (here) we can use the Google Spreadsheet-as-a-database query tool (as described in Asking Questions of Data – Garment Factories Data Expedition) to sum the total number of jobs “created” by road improvements (from the OpenRefine treatment, I had observed the rows were all distinct – the count of each text facet was 1).

SImple job count query

The sum of jobs “created”? 468, 184. A corresponding sum for the number of homes gives 203,976.

Looking at the refrain through the descriptions, we also notice that the claim is along the lines of: “Lib Dems in Government have allocated £X to fund [road improvement] helping to reduce journey times and create Y new jobs. The project will also help build Z new homes.” Has allocated. So it’s not been spent yet? [T]o create X new jobs. So they haven’t been created yet? And if those jobs are the result of other schemes made possible by road improvements, numbers will be double counted? [W]ill also help build So the home haven’t been built yet, but may well be being claimed as achievements elsewhere?

Note that the numbers I calculated are lower bounds, based on scheme descriptions that contained the specified search phrase and (“helping to reduce journey”) and a job numbers specified according to the pattern detected by the following Jython regular expression:

import re
tmp=value
tmp=re.sub(r'.* creat(e|ing) ([0-9,\.]*) new jobs.*',r'\2',tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

In addition, the housing numbers were extracted only from rows where a number of jobs was identified by that regular expression, and where they were described in a way that could be extracted using the following the Jython regular expression re.sub(r'.* The project will also help build ([0-9,\.]*) new homes.*',r'\1',tmp)

PS I’m reading The Smartest Guys in the Room at the moment, learning about the double counting and accounting creativity employed by Enron, and how confusing publicly reported figures often went unchallenged…

It also makes me wonder about phrases like “up to” providing numbers that are then used when calculating totals?

City Deal jobs - up to

So there’s another phrase to look for, maybe? have agreed a new ‘City Deal’ with

Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine

As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?

Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.

libdems million jobs

If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…

libdems jobs view source

Ah ha – it’s an embedded map from a Google Fusion Table…

https://www.google.com/fusiontables/embedviz?q=select+col0+from+1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ&viz=MAP&h=false&lat=52.5656923458786&lng=-1.0353351498047232&t=1&z=7&l=col0&y=2&tmplt=3

We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:

https://www.google.com/fusiontables/DataSource?docid=1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ

Lib dems jobs Fusion tables

The description data is curtailed, but we can see the full description on the card view:

Lib dems fusion tables card

Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.

lib dems jobs data in OpenRefine

This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.

Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:

OpenRefine transpose cell rows to cols2

And as if by magic, we have recreated a workable table:-)

Openrefine rows transposed to cols

If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?

openrefinelibdem jobs text facet

If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?

counting job sites

What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?

Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:

libdems - have given text

Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:

import re
tmp=value
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
return tmp

libdems have given amount

Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.

If we cast these values to a number:

openrefine convert given to number

we can then use a numeric facet to help us explore the amounts.

libdems value numeric facet

In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:

  • Summary: Lib Dems in Government have secured a £73,000 grant for …
  • Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
  • Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
  • Summary: Lib Dems in Government have given £982,000 to …

Here’s an improved regular expression for parsing out some more of these amounts:

import re
tmp=value
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)

if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

So now we can start to identify some of the bigger grants…

libdems jobs big amounts

More to add? eg around:
...have secured a £150,000 grant...
Summary: Lib Dems have given a £1,571,000 grant...
Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...

Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?

how many jobs from road junction improvements?

In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:

import re
tmp=value
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

Lib dems jobs created

If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:

  • £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
  • £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant

But that’s for another day…

If you want to play with the data yourself, you can find it here.

Questioning Election Data to See if It Has a Story to Tell

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson’s Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

o:sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.

A Wrangling Example With OpenRefine: Making “Oven Ready Data”

As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:

onthewight results

There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:

Slightly tidier

Or how about something like this, that shows the size of the electorate for each ward:

turnout

So how can we get from the OnTheWight results into a ready data format?

Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-c); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:

Paste data into OpenRefine

* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:

_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714

_Votes won by figures – filled in
Lots of the ‘Votes won by figures’ had the wrong number in them. It’s one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn’t possible.

‘Postal votes (inc)’ line inserted between ‘Total votes cast’ and ‘Papers spoilt’

Deleted an empty row from Ventnor West

The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.

import data as TSV no blanks

You also need to ensure that the Parse cell text into number/dates… option is selected.

OpenRefine

Here’s what we end up with:

ELection data raw import

The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:

COpy a column

Let’s define the new column as having exactly the same value as the original column:

Create new col as copy of old

Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)

The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:

Facet the numbers

Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.

IW_results_txt_-_OpenRefine

Then remove the numeric facet filter:

filter update

Here’s the result, much tidier:

Much tidier

Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:

Selectt OpenRefine filter

Simply state the value you want to filter on and blitz the matching rows…

CHoose rows then blitz them

…then remove the filter:

then remove the filter

We can now fill down a the blanks in the Electoral Division column:

Fill down on Electoral Division

Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.

Filled down - now flag unwanted row

Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:

facet on flagged row

Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.

Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:

Remove header rows

And then this time just retain the non-numeric rows.

Electoral ward properties

Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:

Rename column

Like so:

columns renamed

Now let’s just make those comma mangled numbers into numbers, by transforming them:

transform the cells by removeing commas

The transform we’re going to use is to replace the comma by nothing:

replace comma

Then convert the values to a number type.

then convert to number

We can the do the same thing for the Number on Roll column:

reuse is good

We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.

Final stages of electroal division data

There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.

a little more tidying to do

Having done that, we can take the opportunity to export the data.

openrefine exporter

Using the custom tabular exporter, we can select the columns we wish to export.

Export column select

Then we can export the data to the desktop as a file in a variety of formats:

OPenrefine export download

Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:

OPenRefine upload to goole

Here’s the data I uploaded.

If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)

change filter settings

When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:

reorder columns

Here’s what I uploaded to a Google spreadsheet:

Spreadsheet

[OpenRefine project file]

So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…

Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.

A Simple OpenRefine Example – Tidying Cut’n’Paste Data from a Web Page

Here’s a quick walkthrough of how to use OpenRefine to prepare a simple data file. The original data can be found on a web page that looks like this (h/t/ The Gravesend Reporter):

polling station list

Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.

Here’s how I want the data to look after I’ve cleaned it:

data in a fusion table

Can you see what needs to be done? Somehow, we need to:

– remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.

If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:

polling stations

We can also paste the data into a new OpenRefine Project:

paste data into OpenRefine

We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:

OpenRefine parse line data

But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?

OpenRefine data in - more cleaning required

Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.

cleaning data part 1

Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.

open refine leave the data the same

If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.

delete the filter

Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.

fill down

If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.

reset filter, star rows, then remove them...

Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…

we're going to tidy

…by replacing the colon with nothing (an empty string).

tidy the column

Here’s the data – neat and tidy:-)

Neat and tidy...

To finish, let’s export the data.

prepare to export

How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).

upload to fusion table

And here it is:-)

data in a fusion table

So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)