Keeping Up With OpenRefine – Database Connections

It’s been a few months since I last checked out updates to OpenRefine, but reading a (completed) phase 1 project plan associated with some funding the OpenRefine Foundation received from Google News Labs it looks like database support is on the cards.

Database Table import/export – COMPLETED

Historically, OpenRefine has been limited compared to other data tools in that it does not have a way to connect to a database table. This is especially useful at export time, when there is a need to save a cleaned CSV for example into a database table. Importing from a database is useful also. It can help to join clean data in a database table against messy data in OpenRefine, in order to clean and prepare it for use. Database Drivers exist for many databases such as Oracle, MySQL, Postgres, and even many schema-less databases such as MongoDB. Most database drivers use JDBC which makes it easier for us to develop against, and others typically use a custom Java driver that sometimes is non-trivial to integrate with. Since OpenRefine is built with Java this should be relatively straightforward to utilize existing JDBC drivers for our import/export operations and for support of MongoDB there is a Java driver available.

Looking through the repo, it looks like there are a couples of related PRs:

I’m not sure about the export to a db?

The tests suggest drivers are in place for PostgreSQL, MySQL and MariaDB:

public class DatabaseTestConfig extends DBExtensionTests {

private DatabaseConfiguration mysqlDbConfig;
private DatabaseConfiguration pgsqlDbConfig;
private DatabaseConfiguration mariadbDbConfig;

It also looks like an upgrade to the internal data representation may be being considered: Research Apache Arrow to improve in-memory data model. FWIW, I think Apache Arrow really is one to watch.

Via the OpenRefine Google Group, I also noticed a couple of references to future planned activity / roadmap items:

Phase 2

Front / Backend separation

Scope: completely separating the backend so that an full API can be exposed for all OpenRefine operations and commands. Once the decoupling done, we can move to a modern front end framework and
Deliverable: Functional and documented API covering all the commands available in OpenRefine 3 front end.

Phase 3
R Lang support
Work with community to bring support for R lang via an extension.
https://github.com/OpenRefine/OpenRefine/issues/1226
There is significant use of statistics within News Organizations where the goal of minimizing the back and forth between R tooling and OpenRefine would be explored and assessed by the community.

rrefine is around and needs investigation – https://github.com/vpnagraj/rrefine

Hmmm… rrefine?

rrefine enables users to programmatically trigger data transfer between R and OpenRefine. Using the functions available in this package, you can import, export or delete a project in OpenRefine directly from R. There are several client libraries for automating OpenRefine tasks via Python, nodeJS and Ruby. rrefine extends this functionality to R users.

Okay – that makes me think of the OpenRefine Python Client Library?

But how about that Edit cells > Transform > Language support for R #1226` issue? “This is a feature-request to add R support in Edit cells > Transform > Language.”

That fits in with an earlier thought I had along the lines of “what if OpenRefine was a Jupyter client?” In an imagining frame of mind, this seems to me to offer a couple of potential benefits:

  • if the Transform > Language utility supports hooks into a Jupyter kernel and exposes an executable code cell onto that (state persisting) kernel, and the data can be transferred efficiently using serialisations like feather or deeper hooks into Apache Arrow representations that might be supported in R or Python pandas, then any language with a Jupyter kernel could be used for transformations?
  • if OpenRefine was exposed as a panel in Jupyterlab, which it presumably could be simply by embedding the HTML UI in an IFrame, then it have a role as part of the look and feel of a single working environment, even if it was only loading and saving CSV files into the environment workspace.

But then let’s imagine something a bit more extreme (I’m not sure if / how this might fit into the Jupyterlab architecture, indeed whether it’s possible or just imagine magic, I’m just riffing…): if the data being manipulated within OpenRefine could be synched with a representation of the data being manipulated elsewhere in the Jupyterlab environment, then we could be viewing a dataset in one panel (Jupyterlab has crazy efficient support for viewing large datafiles), manipulating it in an OpenRefine panel, and running analysis scripts over it in a third. The reticulate package suddenly comes to mind here as an example of accessing data objects from one environment in another.

It also strikes me that use cases of the data represented in OpenRefine reflecting updates to the data from the analysis environment are less likely. The analysis should be operating on data after it has been cleaned, rather than passing it to OpenRefine?

PS by the by, if you want to run OpenRefine using the Jupyter ecosystem Binderhub machinery, here’s a proof of concept from @betatim: openrefineder.

Converting Spreadsheet Rows to Text Based Summary Reports Using OpenRefine

In Writing Each Row of a Spreadsheet as a Press Release? I demonstrated how we could generate a simple textual report template that could “textualise” separate rows of a spreadsheet. This template could be applied to each row from a subset of rows to to produce a simple human readable view of the data contained in each of those rows. I picked up on the elements of this post in Robot Journalists or Robot Press Secretaries?, where I reinforced the idea that such an approach was of a similar kind to the approach used in mail merge strategies supported by many office suites.

It also struck me that we could use OpenRefine’s custom template export option to generate a similar sort of report. So in this post I’ll describe a simple recipe for recreating the NHS Complaints review reports from a couple of source spreadsheets using OpenRefine.

This is just a recasting of the approach demonstrated in the Writing Each Row… post, and more fully described in this IPython notebook, so even if you don’t understand Python, it’s probably worth reviewing those just to get a feeling of the steps involved.

To start with, let’s see how we might generate a basic template from the complaints CSV file, loaded in with the setting to parse numerical columns as such.

OpenRefine

The default template looks something like this:

default template

We can see how each the template provides a header slot, for the start of the output, a template applied to each row, a separator to spilt the rows, and a footer.

The jsonize function makes sure the output is suitable for output as a JSON file. We just want to generate text so we can forget that.

Here’s the start of a simple report…

Report for {{cells["Practice_Code"].value}} ({{cells["Year"].value}}):

  Total number of written complaints received:
  - by area: {{cells["Total number of written complaints received"].value}} (of which, {{cells["Total number of written 
complaints upheld"].value}} upheld)
  - by subject: {{cells["Total number of written complaints received 2"].value}} (of which, {{cells["Total number of written 
complaints upheld 2"].value}} upheld)

custom_export _start

The double braces ({{ }} allow you to access GREL statements. Outside the braces, the content is treated as text.

Note that the custom template doesn’t get saved… I tend to write the custom templates in a text editor, then copy and paste them into OpenRefine.

We can also customise the template with some additional logic using the if(CONDITION, TRUE_ACTION, FALSE_ACTION) construction. For example, we might flag a warning that a lot of complaints were upheld:

openrefine template warning

The original demonstration pulled in additional administrative information (practice name and address, for example) from another source spreadsheet. Merging Datasets with Common Columns in Google Refine describes a recipe for merging in data from another dataset. In this case, if our source is the epraccur spreadsheet, we can create an OpenRefine project from the epraccur spreadsheet (use no lines as the header – it doesn’t have a header row) and then merge in data from the epraccur project into the complaints project using the practice code (Column 1 in the epraccur project) as the key column used to add an additional practice name column based on the Practice_Code column in the complaints project – cell.cross("epraccur xls", "Column 1").cells["Column 2"].value[0]

Note that columns can only be merged in one column at a time.

In order to filter the rows so we can generate reports for just the Isle of Wight, we also need to merge in the Parent Organisation Code (Column 15) from the epraccur project. To get Isle of Wight practices, we could then filter on code 10L. If we then used out custom exporter template, we could get just textual reports for the rows corresponding to Isle of Wight GP practices.

nhs openrefine filter

Teasing things apart a bit, we also start to get a feel for a more general process. Firstly, we can create a custom export template to generate a textual representation of each row in a dataset. Secondly, we can use OpenRefine’s filtering tools to select which rows we want to generate reports from, and order them appropriately. Thirdly, we could also generate new columns containing “red flags” or news signals associated with particular rows, and produce a weighted sum column on which to rank items in terms of newsworthiness. We might also want to merge in additional data columns from other sources, and add elements from those in to the template. Finally, we might start to refine the export template further to include additional logic and customisation of the news release output.

See also Putting Points on Maps Using GeoJSON Created by Open Refine for a demo of how to generate a geojson file using the OpenRefine custom template exporter as part of a route to getting points onto a map.

OpenRefine Docker Containers

I had a go at building a couple of docker containers for OpenRefine, one from the latest release and one from the latest code on github:

In order to create the virtual machine, you should:

  • install boot2docker
  • run boot2docker
  • Either: to run with a project directory solely within the container, in the boot2docker terminal, enter the command docker run --name openrefine -d -p 3334:3333 psychemedia/openrefine
  • Or: to run with a project directory mounted from a shared folder on the host, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /path/to/yourSharedDirectory:/mnt/refine --name openrefine psychemedia/openrefine
  • Or: to run with a project directory in a linked data volume, in the boot2docker terminal, enter the command docker run -d -p 3334:3333 -v /mnt/refine --name openrefine psychemedia/openrefine

(To use the latest release rather than a recent build use psychemedia/docker-openrefine rather than psychemedia/openrefine.)

The port number you will be able to find OpenRefine on is given by the first number set in the flag -p NNNN:3333. To access OpenRefine via port 3334, use -p 3334:3333 etc.

OpenRefine will then be available via your browser at the URL http://IPADDRESS:NNNN. To find the required value of IPADDRESS can be found using the command boot2docker ip

The returned IP address (eg 192.168.59.103) is the IP address you can find OpenRefine on, for example: http://192.168.59.103:3334.

PS for a Dockerfile in the current directory, to build an image from the Dockerfile, use something like:

docker build -t myname/mycontainer .

ScreenScraping HTML Web Pages With OpenRefine – Norwegian Oil Company Data

[An old post, rescued from the list of previously unpublished posts…]

Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.

The setting actually comes from OpenOil’s Johnny West:

1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions

Let’s see how far we can get…

To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:

– looking up company ID can be constructed around http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]

http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.

Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:

openRefine xls import

So that’s step 1…

We can run step 2 as follows* – create a new column from the company column:

* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…

openRefine add new col

Here’s how we construct the URL:

OpenRefine - get new col by URL

The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.

table to scrape - crib

Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:

[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() ].join('::')

The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.

The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() – pulls out the company name from the first search result.

We place these two parts into an array and then join them with two colons: [].join('::')

This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.

open refine - compare names

We can now split the name/ID pair column into two separate columns:

openRefine spilt column into cols

And the result:

openrefne  cols now split

The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:

openrefine add another col by URL

(We could probably reduce the throttle time by an order of magnitude!)

The next step, step 4, is to pull out shareholders and their percentages.

The first step is to grab the shareholder table and each of the rows, which in the original looked like this:

shareholders table

The following hack seems to get us the rows we require:

[REMOVED]

BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(

fckwt

So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(

Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(

Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):

<div id="companyOwnership" class="box">
	<div class="boxHeader">Shareholders:</div>
	<div class="boxContent">
		<div class="row rowHeading">
			<label class="fl" style="width: 70%;">Company name:</label>
			<label class="fl" style="width: 30%;">Percentage share (%)</label>
			<div class="cb"></div>
		</div>
		<div class="row odd">
			<label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label>
			<div class="fr" style="width: 30%;">100.00%</div>
			<div class="cb"></div>
		</div>
	</div>

For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:

forEach(value.parseHtml().select('div[id=companyOwnership]')[-1].select('div.row'),e,e).join('::')

openrefine last company ownership

We can then generate one row for each shareholder in OpenRefine:

open refine - spilt

(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)

For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…

value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')

horrible hack openrefine

Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):

openrefine column split

We then need to tidy the two columns:

value.replace("<\/a>",'').replace(/.*>/,'')

Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…

value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()

We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually – this is not right, as we can see below for the line Altinex Oil Norway AS…? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).

This is what we get:

COmpany ownership

We can then export this file, before considering steps 5 and 6, using the custom exporter:

open refine exporter

Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…

column export

And then select the export format:

column export format

Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.

Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments… The original link will be the one used in the OpenRefine project…]

Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?

In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.

Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…

This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?

Diff or Chop? Github, CSV data files and OpenRefine

A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:

csv diff

With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.

* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.

So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.

One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:

openrefine history

Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.

A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.

Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:

open refine extract

This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?

For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.

OPen Refine export to google

So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)

For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)

And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.

Do Road Improvements *Really* Create Jobs?

Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone, helping to reduce journey times and create 10,400 new jobs. Really? 10,400 new jobs?

In Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine I had a little poke around some of the data that was used to power a map on a Lib Dems’ website, A Million Jobs:

Liberal Democrats have helped businesses create over 1 million new private sector jobs. Click on the map below to find out what we’ve done where you live.

And then there was the map…

libdems million jobs

One thing we might take away from this as an assumption is that the markers correspond to locations or environs where jobs were created, and that by adding up the number of jobs created at those locations, we would get to a number over a million.

Whilst I was poking through the data that powers the map, I started to think this might be an unwarranted assumption. I also started to wonder about how the “a million jobs” figure was actually calculated?

Using a recipe described in the Critiquing Data Stories post, I pulled out marker descriptions containing the phrase “helping to reduce journey” along with the number of jobs created (?!) associated with those claims, where a number was specified.

Lib Dems Road jobs

Claims were along the lines of:

Summary: Lib Dems in Government have allocated £2,600,000 to fund the A38 Markeaton improvements , helping to reduce journey times and create 12,300 new jobs. The project will also help build 3,300 new homes.

Note that as well as claims about jobs, we can also pull out claims about homes.

Extract homes

If we use OpenRefine’s Custom Tabular Exporter to upload the data to a Google spreadsheet (here) we can use the Google Spreadsheet-as-a-database query tool (as described in Asking Questions of Data – Garment Factories Data Expedition) to sum the total number of jobs “created” by road improvements (from the OpenRefine treatment, I had observed the rows were all distinct – the count of each text facet was 1).

SImple job count query

The sum of jobs “created”? 468, 184. A corresponding sum for the number of homes gives 203,976.

Looking at the refrain through the descriptions, we also notice that the claim is along the lines of: “Lib Dems in Government have allocated £X to fund [road improvement] helping to reduce journey times and create Y new jobs. The project will also help build Z new homes.” Has allocated. So it’s not been spent yet? [T]o create X new jobs. So they haven’t been created yet? And if those jobs are the result of other schemes made possible by road improvements, numbers will be double counted? [W]ill also help build So the home haven’t been built yet, but may well be being claimed as achievements elsewhere?

Note that the numbers I calculated are lower bounds, based on scheme descriptions that contained the specified search phrase and (“helping to reduce journey”) and a job numbers specified according to the pattern detected by the following Jython regular expression:

import re
tmp=value
tmp=re.sub(r'.* creat(e|ing) ([0-9,\.]*) new jobs.*',r'\2',tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

In addition, the housing numbers were extracted only from rows where a number of jobs was identified by that regular expression, and where they were described in a way that could be extracted using the following the Jython regular expression re.sub(r'.* The project will also help build ([0-9,\.]*) new homes.*',r'\1',tmp)

PS I’m reading The Smartest Guys in the Room at the moment, learning about the double counting and accounting creativity employed by Enron, and how confusing publicly reported figures often went unchallenged…

It also makes me wonder about phrases like “up to” providing numbers that are then used when calculating totals?

City Deal jobs - up to

So there’s another phrase to look for, maybe? have agreed a new ‘City Deal’ with

Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine

As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?

Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.

libdems million jobs

If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…

libdems jobs view source

Ah ha – it’s an embedded map from a Google Fusion Table…

https://www.google.com/fusiontables/embedviz?q=select+col0+from+1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ&viz=MAP&h=false&lat=52.5656923458786&lng=-1.0353351498047232&t=1&z=7&l=col0&y=2&tmplt=3

We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:

https://www.google.com/fusiontables/DataSource?docid=1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ

Lib dems jobs Fusion tables

The description data is curtailed, but we can see the full description on the card view:

Lib dems fusion tables card

Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.

lib dems jobs data in OpenRefine

This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.

Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:

OpenRefine transpose cell rows to cols2

And as if by magic, we have recreated a workable table:-)

Openrefine rows transposed to cols

If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?

openrefinelibdem jobs text facet

If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?

counting job sites

What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?

Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:

libdems - have given text

Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:

import re
tmp=value
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
return tmp

libdems have given amount

Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.

If we cast these values to a number:

openrefine convert given to number

we can then use a numeric facet to help us explore the amounts.

libdems value numeric facet

In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:

  • Summary: Lib Dems in Government have secured a £73,000 grant for …
  • Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
  • Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
  • Summary: Lib Dems in Government have given £982,000 to …

Here’s an improved regular expression for parsing out some more of these amounts:

import re
tmp=value
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)

if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

So now we can start to identify some of the bigger grants…

libdems jobs big amounts

More to add? eg around:
...have secured a £150,000 grant...
Summary: Lib Dems have given a £1,571,000 grant...
Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...

Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?

how many jobs from road junction improvements?

In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:

import re
tmp=value
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
if value==tmp:tmp=''
tmp=tmp.replace(',','')
return tmp

Lib dems jobs created

If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:

  • £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
  • £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant

But that’s for another day…

If you want to play with the data yourself, you can find it here.