Posts Tagged ‘openrefine’
[An old post, rescued from the list of previously unpublished posts…]
Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.
The setting actually comes from OpenOil’s Johnny West:
1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions
Let’s see how far we can get…
To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:
- looking up company ID can be constructed around
http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]
- http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.
Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:
So that’s step 1…
We can run step 2 as follows* – create a new column from the company column:
* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…
Here’s how we construct the URL:
The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.
Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:
[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() ].join('::')
The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.
The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]").htmlText() – pulls out the company name from the first search result.
We place these two parts into an array and then join them with two colons: .join('::')
This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.
We can now split the name/ID pair column into two separate columns:
And the result:
The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:
(We could probably reduce the throttle time by an order of magnitude!)
The next step, step 4, is to pull out shareholders and their percentages.
The first step is to grab the shareholder table and each of the rows, which in the original looked like this:
The following hack seems to get us the rows we require:
BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(
So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(
Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(
Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):
<div id="companyOwnership" class="box"> <div class="boxHeader">Shareholders:</div> <div class="boxContent"> <div class="row rowHeading"> <label class="fl" style="width: 70%;">Company name:</label> <label class="fl" style="width: 30%;">Percentage share (%)</label> <div class="cb"></div> </div> <div class="row odd"> <label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label> <div class="fr" style="width: 30%;">100.00%</div> <div class="cb"></div> </div> </div>
For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:
We can then generate one row for each shareholder in OpenRefine:
(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)
For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…
value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')
Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):
We then need to tidy the two columns:
Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…
value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()
We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually – this is not right, as we can see below for the line Altinex Oil Norway AS…? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).
This is what we get:
We can then export this file, before considering steps 5 and 6, using the custom exporter:
Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…
And then select the export format:
Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.
Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments… The original link will be the one used in the OpenRefine project…]
Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?
In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.
Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…
This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?
A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:
With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.
* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.
So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.
One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:
Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.
A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.
Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:
This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?
For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.
So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)
For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)
And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.
Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone, helping to reduce journey times and create 10,400 new jobs. Really? 10,400 new jobs?
In Critiquing Data Stories: Working LibDems Job Creation Data Map with OpenRefine I had a little poke around some of the data that was used to power a map on a Lib Dems’ website, A Million Jobs:
Liberal Democrats have helped businesses create over 1 million new private sector jobs. Click on the map below to find out what we’ve done where you live.
And then there was the map…
One thing we might take away from this as an assumption is that the markers correspond to locations or environs where jobs were created, and that by adding up the number of jobs created at those locations, we would get to a number over a million.
Whilst I was poking through the data that powers the map, I started to think this might be an unwarranted assumption. I also started to wonder about how the “a million jobs” figure was actually calculated?
Using a recipe described in the Critiquing Data Stories post, I pulled out marker descriptions containing the phrase “helping to reduce journey” along with the number of jobs created (?!) associated with those claims, where a number was specified.
Claims were along the lines of:
Summary: Lib Dems in Government have allocated £2,600,000 to fund the A38 Markeaton improvements , helping to reduce journey times and create 12,300 new jobs. The project will also help build 3,300 new homes.
Note that as well as claims about jobs, we can also pull out claims about homes.
If we use OpenRefine’s Custom Tabular Exporter to upload the data to a Google spreadsheet (here) we can use the Google Spreadsheet-as-a-database query tool (as described in Asking Questions of Data – Garment Factories Data Expedition) to sum the total number of jobs “created” by road improvements (from the OpenRefine treatment, I had observed the rows were all distinct – the count of each text facet was 1).
The sum of jobs “created”? 468, 184. A corresponding sum for the number of homes gives 203,976.
Looking at the refrain through the descriptions, we also notice that the claim is along the lines of: “Lib Dems in Government have allocated £X to fund [road improvement] helping to reduce journey times and create Y new jobs. The project will also help build Z new homes.” Has allocated. So it’s not been spent yet? [T]o create X new jobs. So they haven’t been created yet? And if those jobs are the result of other schemes made possible by road improvements, numbers will be double counted? [W]ill also help build So the home haven’t been built yet, but may well be being claimed as achievements elsewhere?
Note that the numbers I calculated are lower bounds, based on scheme descriptions that contained the specified search phrase and (“helping to reduce journey”) and a job numbers specified according to the pattern detected by the following Jython regular expression:
tmp=re.sub(r'.* creat(e|ing) ([0-9,\.]*) new jobs.*',r'\2',tmp)
In addition, the housing numbers were extracted only from rows where a number of jobs was identified by that regular expression, and where they were described in a way that could be extracted using the following the Jython regular expression re.sub(r'.* The project will also help build ([0-9,\.]*) new homes.*',r'\1',tmp)
PS I’m reading The Smartest Guys in the Room at the moment, learning about the double counting and accounting creativity employed by Enron, and how confusing publicly reported figures often went unchallenged…
It also makes me wonder about phrases like “up to” providing numbers that are then used when calculating totals?
So there’s another phrase to look for, maybe? have agreed a new ‘City Deal’ with …
As well as creating data stories, should the role of a data journalist be to critique data stories put out by governments, companies, and political parties?
Via a tweet yesterday I saw a link to a data powered map from the Lib Dems (A Million Jobs), which claimed to illustrate how, through a variety of schemes, they had contributed to the creation of a million private sector jobs across the UK. Markers presumably identify where the jobs were created, and a text description pop up provides information about the corresponding scheme or initiative.
If we view source on the page, we can see where the map – and maybe the data being used to power it, comes from…
Ah ha – it’s an embedded map from a Google Fusion Table…
We can view the table itself by grabbing the key – 1whG2X7lpAT5_nfAfuRPUc146f0RVOpETXOwB8sQ – and poppiing it into a standard URL (grabbed from viewing another Fusion Table within Fusion Tables itself) of the form:
The description data is curtailed, but we can see the full description on the card view:
Unfortunately, downloads of the data have been disabled, but with a tiny bit of thought we can easily come up with a tractable, if crude, way of getting the data… You may be able to work out how when you see what it looks like when I load it into OpenRefine.
This repeating pattern of rows is one that we might often encounter in data sets pulled from reports or things like PDF documents. To be able to usefully work with this data, it would be far easier if it was arranged by column, with the groups-of-three row records arranged instead as a single row spread across three columns.
Looking through the OpenRefine column tools menu, we find a transpose tool that looks as if it may help with that:
And as if by magic, we have recreated a workable table:-)
If we generate a text facet on the descriptions, we can look to see how many markers map onto the same description (presumably, the same scheme?
If we peer a bit more closely, we see that some of the numbers relating to job site locations as referred to in the description don’t seem to tally with the number of markers? So what do the markers represent, and how do they relate to the descriptions? And furthermore – what do the actual postcodes relate to? And where are the links to formal descriptions of the schemes referred to?
What this “example” of data journalistic practice by the Lib Dems shows is how it can generate a whole wealth of additional questions, both from a critical reading just of the data itself, (for example, trying to match mentions of job locations with the number of markers on the map or rows referring to that scheme in the table), as we all question that lead on from the data – where can we find more details about the local cycling and green travel scheme that was awarded £590,000, for example?
Using similar text processing techniques to those described in Analysing UK Lobbying Data Using OpenRefine, we can also start trying to pull out some more detail from the data. For example, by observation we notice that the phrase Summary: Lib Dems in Government have given a £ starts many of the descriptions:
Using a regular expression, we can pull out the amounts that are referred to in this way and create a new column containing these values:
tmp = re.sub(r'Summary: Lib Dems in Government have given a £([0-9,\.]*).*', r'\1', tmp)
if value==tmp: tmp=''
tmp = tmp.replace(',','')
Note that there may be other text conventions describing amounts awarded that we could also try to extract as part of thes column creation.
If we cast these values to a number:
we can then use a numeric facet to help us explore the amounts.
In this case, we notice that there weren’t that many distinct factors containing the text construction we parsed, so we may need to do a little more work there to see what else we can extract. For example:
- Summary: Lib Dems in Government have secured a £73,000 grant for …
- Summary: Lib Dems in Government have secured a share of a £23,000,000 grant for … – we might not want to pull this into a “full value” column if they only got a share of the grant?
- Summary: Lib Dems in Government have given local business AJ Woods Engineering Ltd a £850,000 grant …
- Summary: Lib Dems in Government have given £982,000 to …
Here’s an improved regular expression for parsing out some more of these amounts:
tmp=re.sub(r'Summary: Lib Dems in Government have given (a )?£([0-9,\.]*).*',r'\2',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have secured a ([0-9,\.]*).*',r'\1',tmp)
tmp=re.sub(r'Summary: Lib Dems in Government have given ([^a]).* a £([0-9,\.]*) grant.*',r'\2',tmp)
So now we can start to identify some of the bigger grants…
More to add? eg around:
– ...have secured a £150,000 grant...
– Summary: Lib Dems have given a £1,571,000 grant...
– Summary: Lib Dems in Government are giving £10,000,000 to... (though maybe this should go in an ‘are giving’ column, rather than ‘have given’, cf. “will give” also…?)
– Here’s another for a ‘possible spend’ column? Summary: Lib Dems in Government have allocated £300,000 to...
Note: once you start poking around at these descriptions, you find a wealth of things like: “Summary: Lib Dems in Government have allocated £300,000 to fund the M20 Junctions 6 to 7 improvement, Maidstone , helping to reduce journey times and create 10,400 new jobs. The project will also help build 8,400 new homes.” Leading to ask the question: how many of the “one million jobs” arise from improvements to road junctions…?
In order to address this question, we might to start have a go at pulling out the number of jobs that it is claimed various schemes will create, as this column generator starts to explore:
tmp = re.sub(r'.* creat(e|ing) ([0-9,\.]*) jobs.*', r'\2', tmp)
If we start to think analytically about the text, we start to see there may be other structures we can attack… For example:
- £23,000,000 grant for local business ADS Group. … – here we might be able to pull out what an amount was awarded for, or to whom it was given.
- £950,000 to local business/project A45 Northampton to Daventry Development Link – Interim Solution A45/A5 Weedon Crossroad Improvements to improve local infastructure, creating jobs and growth – here we not only have the recipient but also the reason for the grant
But that’s for another day…
If you want to play with the data yourself, you can find it here.
I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…
So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?
Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):
Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:
So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.
In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)
Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:
In OpenRefine, we can Create a New Project and then import the data directly:
The data is in comma separated CSV format, so let’s specify that:
We can then name and create the project and we’re ready to start…
…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.
Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.
And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:
Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT
Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT
The %20 is a character encoding for a space. In this case, we can also use a +.
So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)
Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.
To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.
What we want to do then is to create a new, derived column based on the address:
And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:
Here’s the result…
Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:
When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:
The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.
When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…
We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:
We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:
filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )['name']
to create a new column containing the electoral division:
Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:
If we collapse down the spare columns, we get a clearer picture:
If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):
We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…
Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)
There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson’s Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?
One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:
'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')
The data, as imported from the Ordnance Survey, looks something like this:
As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.
Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.
The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:
Here’s what the returned lat/long data looks like:
A similar trick can be used to generate a column containing just the longitude data.
We can then export a view over the data to a CSV file, or direct to Google Fusion tables.
With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222
Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:
We can inspect the location data using a more convenient “natural” view over it…
By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:
(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)
We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:
Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:
Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…
So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.
As well as being a great tool for cleaning data, OpenRefine can also be used to good effect when you need to wrangle a dataset into another shape. Take this set of local election results published by the Isle of Wight local online news blog, onthewight.com:
There’s lots of information in there (rank of each candidate for each electoral division, votes cast per candidate, size of electorate for the division, and hence percentage turnout, and so on), and it’s very nearly available in a ready data format – that is, a data format that is ready for reuse… Something like this, for example:
Or how about something like this, that shows the size of the electorate for each ward:
So how can we get from the OnTheWight results into a ready data format?
Let’s start by copying all the data from OnTheWight (click into the results frame, select all (ctrl-A) and copy (ctrl-v); I’ve also posted a copy of the data I grabbed here*), then paste the data into a new OpenRefine project:
* there were a couple of data quality issues (now resolved in the sheet published by OnTheWight) which relate to the archived data file/data used in this walkthrough. Here are the change notes from @onTheWight:
_Corrected vote numbers
Totland - Winning votes wrong - missed zero off end - 420 not 42
Brading, St Helens & Bembridge - Mike Tarrant (UKIP) got 741 not 714
_Votes won by figures - filled in
Lots of the 'Votes won by figures' had the wrong number in them. It's one of the few figures that needed a manual formula update and in the rush of results (you heard how fast they come), it just wasn't possible.
'Postal votes (inc)' line inserted between 'Total votes cast' and 'Papers spoilt'
Deleted an empty row from Ventnor West
The data format is “tab separated”, so we can import it as such. We might as well get rid of the blank lines at the same time.
Here’s what we end up with:
The data format I want is has a column specifying the ward each candidate stood in. Let’s start by creating a new column that is a copy of the column that has the Electoral Division names in it:
Let’s define the new column as having exactly the same value as the original column:
Now we start puzzling based on what we want to achieve bearing in mind what we can do with OpenRefine. (Sometimes there are many ways of solving a problem, sometimes there is only one, sometimes there may not be any obvious route…)
The Electoral Division column contains the names of the Electoral Divisions on some rows, and numbers (highlighted green) on others. If we identify the rows containing numbers in that column, we can blank them out… The Numeric facet will let us do that:
Select just the rows containing a numeric value in the Electoral Division column, and then replace those values with blanks.
Then remove the numeric facet filter:
Here’s the result, much tidier:
Before we fill in the blanks with the Electoral Division names, let’s just note that there is at least one “messy” row in there corresponding to Winning Margin. We don’t really need that row – we can always calculate it – so let’s remove it. One way of doing this is to display just the rows containing the “Winning margin” string in column three, and then delete them. We can use the TExt filter to highlight the rows:
Simply state the value you want to filter on and blitz the matching rows…
…then remove the filter:
We can now fill down a the blanks in the Electoral Division column:
Fill down starts at the top of the column then works its way down, filling in blank cells in that column with whatever was in the cell immediately above.
Looking at the data, I notice the first row is also “unwanted”. If we flag it, we can then facet/filter on that row from the All menu:
Then we can Remove all matching rows from the cell menu as we did above, then remove the facet.
Now we can turn to just getting the data relating to votes cast per candidate (we could also leave in the other returns). Let’s use a trick we’ve already used before – facet by numeric:
And then this time just retain the non-numeric rows.
Hmmm..before we remove it, this data could be worth keeping too in its own right? Let’s rename the columns:
Now let’s just make those comma mangled numbers into numbers, by transforming them:
The transform we’re going to use is to replace the comma by nothing:
Then convert the values to a number type.
We can the do the same thing for the Number on Roll column:
We seem to have a rogue row in there too – a Labour candidate with a 0% poll. We can flag that row and delete it as we did above.
There also seem to be a couple of other scrappy rows – the overall count and another rogue percentage bearing line, so again we can flag these, do an All facet on them, remove all rows and then remove the flag facet.
Having done that, we can take the opportunity to export the data.
Using the custom tabular exporter, we can select the columns we wish to export.
Then we can export the data to the desktop as a file in a variety of formats:
Or we can upload it to a Google document store, such as Google Spreadsheets or Google Fusion Tables:
Here’s the data I uploaded.
If we go back to the results for candidates by ward, we can export that data too, although I’d be tempted to do a little bit more tidying, for example by removing the “Votes won by” rows, and maybe also the Total Votes Cast column. I’d probably also rename what is now the Candidates column to something more meaningful! (Can you work out how?!;-)
When we upload the data, we can tweak the column ordering first so that the data makes a little more sense at first glance:
Here’s what I uploaded to a Google spreadsheet:
So – there you have it… another OpenRefine walkthrough. Part conversation with the data, part puzzle. As with most puzzles, once you start to learn the tricks, it becomes ever easier… Or you can start taking on ever more complex puzzles…
Although you may not realise it, most of the work related to generating raw graphics has now been done. Once the data has a reasonable shape to it, it becomes oven ready, data ready, and is relatively easy to work with.
Take a minute or two to try to get your head round how this data is structured… What do you see? I see different groups of addresses, one per line, separated by blank lines and grouped by “section headings” (ward names perhaps?). The ward names (if that’s what they are) are uniquely identified by the colon that ends the line they’re on. None of the actual address lines contain a colon.
Here’s how I want the data to look after I’ve cleaned it:
Can you see what needs to be done? Somehow, we need to:
- remove the blank lines;
– generate a second column containing the name of the ward each address applies to;
– remove the colon from the ward name;
– remove the rows that contained the original ward names.
If we highlight the data in the web page, copy it and paste it into a text editor, it looks like this:
We can also paste the data into a new OpenRefine Project:
We can use OpenRefine’s import data tools to clean the blank lines out of the original pasted data:
But how do we get rid of the section headings, and use them as second column entries so we can see which area each address applies to?
Let’s start by filtering to data to only show rows containing the headers, which we note that we could identify because those rows were the only rows to contain a colon character. Then we can create a second column that duplicates these values.
Here’s how we create the new column, which we’ll call “Wards”; the cell contents are simply a duplicate of the original column.
If we delete the filter that was selecting rows where the Column 1 value included a colon, we get the original data back along with a second column.
Starting at the top of the column, the “Fill Down” cell operation will fill empty cells with the value of the cell above.
If we now add the “colon filter” back to Column 1, to just show the area rows, we can highlight all those rows, then delete them. We’ll then be presented with the two column data set without the area rows.
Let’s just tidy up the Wards column too, by getting rid of the colon. To do that, we can transform the cell…
…by replacing the colon with nothing (an empty string).
Here’s the data – neat and tidy:-)
To finish, let’s export the data.
How about sending it to a Google Fusion table (you may be asked to authenticate or verify the request).
And here it is:-)
So – that’s a quick example of some of the data cleaning tricks and operations that OpenRefine supports. There are many, many more, of course…;-)