OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘schoolofdata

ScreenScraping HTML Web Pages With OpenRefine – Norwegian Oil Company Data

[An old post, rescued from the list of previously unpublished posts...]

Although I use OpenRefine from time time, one thing I don’t tend to use it for is screenscraping HTML web pages – I tend to write Python scripts in Scraperwiki to do this. Writing code is not for everyone, however, so I’ve brushed off my searches of the OpenRefine help pages to come up with this recipe for hacking around with various flavours of company data.

The setting actually comes from OpenOil’s Johnny West:

1) given the companies in a particular spreadsheet… for example “Bayerngas Norge AS” (row 6)
2) plug them into the Norwegian govt’s company registry — http://www.brreg.no/ (second search box down nav bar on the left) – this gives us corporate identifier… so for example… 989490168
3) plug that into purehelp.no — so http://www.purehelp.no/company/details/bayerngasnorgeas/989490168
4) the Aksjonærer at the bottom (the shareholders that hold that company) – their percentages
5) searching OpenCorporates.com with those names to get their corporate identifiers and home jurisdictions
6) mapping that back to the spreadsheet in some way… so for each of the companies with their EITI entry we get their parent companies and home jurisdictions

Let’s see how far we can get…

To start with, I had a look at the two corporate search sites Johnny mentioned. Hacking around with the URLs, there seemed to be a couple of possible simplifications:

- looking up company ID can be constructed around http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS – the link structure has changed since I originally wrote this post, correct form is now http://w2.brreg.no/enhet/sok/treffliste.jsp?navn=Bayerngas+Norge+AS&orgform=0&fylke=0&kommune=0&barebedr=false [h/t/ Larssen in the comments.]

- http://www.purehelp.no/company/details/989490168 (without company name in URL) appears to work ok, so can get there from company number.

Loading the original spreadsheet data into OpenRefine gives us a spreadsheet that looks like this:

openRefine xls import

So that’s step 1…

We can run step 2 as follows* – create a new column from the company column:

* see the end of the post for an alternative way of obtaining company identifiers using the OpenCorporates reconciliation API…

openRefine add new col

Here’s how we construct the URL:

OpenRefine - get new col by URL

The HTML is a bit of a mess, but by Viewing Source on an example page, we can find a crib that leads us close to the data we require, specifically the fragment detalj.jsp?orgnr= in the URL of the first of the href attributes of the result links.

table to scrape - crib

Using that crib, we can pull out the company ID and the company name for the first result, constructing a name/id pair as follows:

[value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() , value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() ].join('::')

The first part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlAttr("href").replace('detalj.jsp?orgnr=','').toString() – pulls out the company ID from the first search result, extracting it from the URL fragment.

The second part – value.parseHtml().select("a[href^=detalj.jsp?orgnr=]")[0].htmlText() – pulls out the company name from the first search result.

We place these two parts into an array and then join them with two colons: [].join('::')

This keeps thing tidy and allows us to check by eye that sensible company names have been found from the original search strings.

open refine - compare names

We can now split the name/ID pair column into two separate columns:

openRefine spilt column into cols

And the result:

openrefne  cols now split

The next step, step 3, requires looking up the company IDs on purehelp. We’ve already see how a new column can be created from a source column by URL, so we just repeat that approach with a new URL pattern:

openrefine add another col by URL

(We could probably reduce the throttle time by an order of magnitude!)

The next step, step 4, is to pull out shareholders and their percentages.

The first step is to grab the shareholder table and each of the rows, which in the original looked like this:

shareholders table

The following hack seems to get us the rows we require:

[REMOVED]

BAH – crappy page sometimes has TWO companyOwnership IDs, when the company has shareholdings in other companies as well as when it has shareholders:-(

fckwt

So much for unique IDs… ****** ******* *** ***** (i.e. not happy:-(

Need to search into table where “Shareholders” is specified in top bar of the table, and I don’t know offhand how to do that using the GREL recipe I was taking because the HTML of the page is really horrible. Bah…. #ffs:-(

Question, in GREL, how do I get the rows in this not-a-table? I need to specify the companyOwnership id in the parent div, and check for the Shareholders text() value in the first child, then ideally miss the title row, then get all the shareholder companies (in this case, there’s just one; better example):

<div id="companyOwnership" class="box">
	<div class="boxHeader">Shareholders:</div>
	<div class="boxContent">
		<div class="row rowHeading">
			<label class="fl" style="width: 70%;">Company name:</label>
			<label class="fl" style="width: 30%;">Percentage share (%)</label>
			<div class="cb"></div>
		</div>
		<div class="row odd">
			<label class="fl" style="width: 70%;">Shell Exploration And Production Holdings</label>
			<div class="fr" style="width: 30%;">100.00%</div>
			<div class="cb"></div>
		</div>
	</div>

For now I’m going to take a risky shortcut and assume that the Shareholders (are there always shareholders?) are the last companyOwnership ID on the page:

forEach(value.parseHtml().select('div[id=companyOwnership]')[-1].select('div.row'),e,e).join('::')

openrefine last company ownership

We can then generate one row for each shareholder in OpenRefine:

open refine - spilt

(We’ll need to do some filling in later to cope with the gaps, but no need right now. We also picked up the table header, which has been given it’s own row, which we’ll have to cope with at some point. But again, no need right now.)

For some reason, I couldn’t parse the string for each row (it was late, I was confused!) so I hacked this piecemeal approach to try to take them by surprise…

value.replace(/\s/,' ').replace('<div class="row odd">','').replace('<div class="row even">','').replace('<form>','').replace('<label class="fl" style="width: 70%;">','').replace('<div class="cb"></div>','').replace('</form> </div>','').split('</label>').join('::')

horrible hack openrefine

Using the trick we previously applied to the combined name/ID column, we can split these into two separate columns, one for the shareholder and the other for their percentage holding (I used possibly misleading column names below – should say “Shareholder name”, for example, rather than shareholding 1?):

openrefine column split

We then need to tidy the two columns:

value.replace("<\/a>",'').replace(/.*>/,'')

Note that some of the shareholder companies have identifiers in the website we scraped the data from, and some don’t. We’re going to be wasteful and throw the info away that links the company if it’s there…

value.replace('<div class="fr" style="width: 30%;">','').replace('</div>','').strip()

We now need to do a bit more tidying – fill down on the empty columns in the shareholder company column and also in the original company name and ID [actually - this is not right, as we can see below for the line Altinex Oil Norway AS...? Maybe we can get away with it though?], and filter out the rows that were generated as headers (text facet then select out blank and Fimanavn).

This is what we get:

COmpany ownership

We can then export this file, before considering steps 5 and 6, using the custom exporter:

open refine exporter

Select the columns – including the check column of the name of the company we discovered by searching on the names given in the original spreadsheet… these are the names that the shareholders actually refer to…

column export

And then select the export format:

column export format

Here’s the file: shareholder data (one of the names at least appears not to have worked – Altinex Oil Norway AS). LOoking at the data, I think we also need to take the precaution of using .strip() on the shareholder names.

Here’s the OpenRefine project file to this point [note the broken link pattern for brreg noted at the top of the post and in the comments... The original link will be the one used in the OpenRefine project...]

Maybe export on a filtered version where Shareholding 1 is not empty. Also remove the percentage sign (%) in the shareholding 2 column? ALso note that Andre is “Other”… maybe replace this too?

In order to get the OpenCorporates identifiers, we should be able to just run company names through the OpenCorporates reconciliation service.

Hmmm.. I wonder – do we even have to go that far? From the Norwegian company number, is the OpenCorporates identifier just that number in the Norwegian namespace? So for BAYERNGAS NORGE AS, which has Norwegian company number 989490168, can we look it up directly on OpenCorporates as http://opencorporates.com/companies/no/989490168? It seems like we can…

This means we possibily have an alternative to step 2 – rather than picking up company numbers by searching into and scraping the Norwegian company register, we can reconcile names against the OpenCorporates reconciliation API and then pick up the company numbers from there?

Written by Tony Hirst

October 10, 2013 at 11:14 pm

To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info

In More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other? I described a visual route to finding out which local council candidates had supported each other on their nomination papers. There is also a thirty second route to that data that I should probably have mentioned;-)

From the Scraperwiki database, we need to interrogate the API:

scraperwiki api

To do this, we’ll use a database query language – SQL.

What we need to ask the database is which of the assentors (members of the support column) are also candidates (members of the candinit column, and just return those rows. The SQL command is simply this:

select * from support where support in (select candinit from support)

Note that “support” refers to two things here – these are columns:

select * from support where support in (select candinit from support)

and these are the table the columns are being pulled from:

select * from support where support in (select candinit from support)

Here’s the result of Runing the query:

sql select on scraperwiki

We can also get a direct link to a tabular view of the data (or generate a link to a CSV output etc from the format selector).

candidates mutual table

There are 15 rows in this result compared to the 15 edges/connecting lines discovered in the Gephi approach, so each method corroborates the other:

Tidier intra-candidate support map

Simples:-)

Written by Tony Hirst

May 8, 2013 at 10:50 am

More Storyhunting Around Local Elections Data Using Gephi – To What Extent Do Candidates Support Each Other?

In Questioning Election Data to See if It Has a Story to Tell I started to explore various ways in which we could start to search for stories in a dataset finessed out of a set of poll notices announcing the recent Isle of Wight Council elections. In this post, I’ll do a little more questioning, especially around the assentors (proposers, seconders etc) who supported each candidate, looking to see whether there are any social structures in there resulting from candidates supporting each others’ applications. The essence of what we’re doing is some simple social network analysis around the candidate/assentor network. (For an alternative route to the result, see To What Extent Do Candidates Support Each Other Redux – A One-Liner, Thirty Second Route to the Info.)

This is what we’ll be working towards:

Tidier intra-candidate support map

If you want to play along, you can get the data from my IW poll notices scrape on ScraperWiki, specifically the support table.

scraperwiki council elections - assentors

Here’s a reminder of what the original PDF doc looked like (archive copy):

IW poll notice assentors

Checking the extent to which candidates supported each other is something we could do by hand, looking down each candidate’s list of assentors for names of other candidates, but it would be a laborious job. It’s far easier(?!;-) to automate it…

When we want to compare names using a computer programme or script, the simplest approach is to do an exact string match (a string is a list of characters). Two strings match if they are exactly the same, so for example: This string is the same as This string, but not this string (they differ in their first character – upper case T in the first example as compared with lower case t in the last. We’ll be using exact string matching to identify whether a candidate has the same name as any of the assentors, so on the scraper, I did a little fiddling around with the names, in particular generating a new column that recasts the name of the candidate into the same presentation form used to identify the assentors (Firstname I. Lastname).

We can download a CSV representation of the data from the scraper directly:

Scraperwiki CSV download

The first thing I want to explore is the extent to which candidates support other candidates to see if we can identify any political groupings. The tool I’m going to use to visualise the data is Gephi, an open-source cross-platform application (requires Java) that you can download for free from gephi.org.

Gephi.org

To view the data in Gephi, it’s easiest if we rename a couple of columns so that Gephi can recognise relations between supporters and candidates; if we open the CSV download file in a text editor, we can rename the candinit as target and the column as Source to represent an arrow going from an assentor to a candidate, where the arrow reads something along the lines of “is a supporter of”.

csv rename

Start Gephi, select Data Laboratory tab and then New Project from the File menu.

geohi data lab new project

You should now see a toolbar that includes an “Import Spreadsheet option”:

gephi import spreadsheet

Import the CSV file as such, identifying it as an Edges Table:

import data into gephi data laboaratory

You should notice that the Source and Target columns have been identified as such and we have the choice to import the other column or not – let’s bring them in…

SOurce and Target recognised

You should now see the data has been loaded in to Gephi…

Data loaded in

If you click on the Overview tab button, you should see a mass of nodes/circles representing candidates and assentors with arrows going from assentors to candidates.

mess...

Let’s see how they connect – we can Run the Force Atlas 2 Layout algorithm for starters. I tweaked the Scaling value and ticked on Stronger Gravity to help shape the resulting layout:

force layout tweaks

If you look closely, you’ll be able to see that there are many separate groupings of connected circles – this represent candidates who are supported by folk who are not also candidates (sometimes a node sits on top of a line so it looks as if two noes are connected when in fact they aren’t…)

Close up simple patterns

However, there are also other groupings in which one candidate may support another:

candidate support

These connections may allow us to see grouping of candidates supporting each other along party lines.

One of the powerful things about Gephi is that it allows us to construct quite complex, nested filters that we can apply to the data based on the properties of the network the data describes so that we can focus on particular aspects of the network I’m going to filter the network so that it shows only those individuals who are supported by at least one person (in-degree 1 or more) and who support at least one person (out-degree one or more) – that is, folk who are candidates (in-degree 1 or more) who also supported (out degree 1 or more) another candidate. Let’s also turn labels on to see which candidates the filter identifies, and colour the edges along party lines. We can now see some information about the connectedness a little more clearly:

lots going on

Hmmm.. how about if we extend out filter to see who’s connected to these nodes (this might include other candidates who do not themselves assent to another candidate), and also rezise the nodes/labels so we can better see the candidates’ names. The Neigbours Network filter takes the nodes we have and then also finds the nodes that are connected to them to depth 2 in this case (that is, it brings in nodes connected to the candidates who are also supporters (depth 1), and the nodes connected to those nodes (depth two). Which is to say, it will being in the candidates who are supported by candidates, and their supporters:

A few more tweaks

That’s a bit clearer, but there are still overlapping lines, so it may make sense to layout the network again:

improve the layout

We can also experiment with other colourings – if we go to the Statistics panel, we can run a Connected Components filter that tries to find nodes that are connected into distinct groups. We can then colour each of the separate groups uniquely:

colour the groups

Let’s reset the colours and go back to colourings along party lines:

Gephi reset colours

If we go to the Preview view, we can generate a prettified view of the network:

Preview layout

In it, we can clearly see groupings along party lines (inside the blue boxes). There is something odd, though? There appears to be a connection between UKIP and Independent groupings? Let’s zoom in:

this is odd

Going back to the Graph view and zooming in, we see that Paul G. taylor appears to be supporting two candidates of different parties… Hmm – I wonder: are there actually two Paul G. Taylors, I wonder, with different political preferences? (Note to self: check on Electoral Commission website what regulations there are about assenting. Can you only assent to one person, and then only within the ward in which you are registered to vote? For local elections, could you be registered to vote in more than one electoral division within the same council area?)

To check that there are no other names that support more than one candidate, we can create another, simple filter that just selects nodes with out-degree 2 or more – that is, who support 2 or more other nodes:

Filter on nodes out degree 2

Just that one then…

Looking at the fuller chart, it’s still rather scruffy. We could tidy it by removing assentors who are not themselves candidates (that is, there are no arrows pointing in to them). The way Gephi filters work support chaining. If you look at the filters, you will see they are nested, much like a nested comment thread in a forum. Filters at the bottom of the tree act on the graph and pass the filtereed network to date up the tree to the next filter. This means we can pass the network as shown above into another filter layer that removes folk who are “just” assentors and not candidates.

nested filters

Here’s the result:

Nesting filters in gephi

And again we can go into Preview mode to generate a nice vectorised version of the graph:

Tidier intra-candidate support map

This quite clearly shows several mutual support networks between Labour candidates (red edges), Conservative candidates (blue edges), independents (black edges) and a large grouping of UKIP candidates (purple edges).

So there we have it a quick tour of how to use Gephi to look at the co-support structure of group of local election candidates. Were the highlighted candidates to be successful in their election, it could signify possible factions or groupings within the council, particular amongst the independents? Along the way we saw how to make use of filters, and spotted something we need to check (whether the same person supported two candidates (if that isn’t allowed?) or whether they are two different people sharing the same name.

If this all seems like too much effort, remembers that there’s always the One-Liner, Thirty Second Route to the Info.

PS by the by, a recent FOI request on WhatDoTheyKnow suggests another possible line of enquiry around possible candidates – if they have been elected to the council before, how good was their attendance record? (I don’t think OpenlyLocal scrapes this information? Presumably it is available somewhere on the council website?)

Written by Tony Hirst

May 8, 2013 at 9:05 am

Posted in Tinkering

Tagged with , ,

Questioning Election Data to See if It Has a Story to Tell

I know, I know, the local elections are old news now, but elections come round again and again, which means building up a set of case examples of what we might be able to do – data wise – around elections in the future could be handy…

So here’s one example of a data-related question we might ask (where in this case by data I mean “information available in: a) electronic form, that b) can be represented in a structured way): are the candidates standing in different seats local to that ward/electoral division?. By “local”, I mean – can they vote in that ward by virtue of having a home address that lays within that ward?

Here’s what the original data for my own local council (the Isle of Wight council, a unitary authority) looked like – a multi-page PDF document collating the Notice of polls for each electoral division (archive copy):

IW council - notice of poll

Although it’s a PDF, the document is reasonably nicely structured for scraping (I’ll do a post on this over the next week or two) – you can find a Scraperwiki scraper here. I pull out three sorts of data – information about the polling stations (the table at the bottom of the page), information about the signatories (of which, more in a later post…;-), and information about the candidates, including the electoral division in which they were standing (the “ward” column) and a home address for them, as shown here:

scraperwiki candidates

So what might we be able to do with this information? Does the home address take us anywhere interesting? Maybe. If we can easily look up the electoral division the home addresses fall in, we have a handful of news story search opportunities: 1) to what extent are candidates – and election winners – “local”? 2) do any of the parties appear to favour standing in/out of ward candidates? 3) if candidates are standing out of their home ward, why? If we complement the data with information about the number of votes cast for each candidate, might we be able to find any patterns suggestive of a beneficial or detrimental effect living within, or outside of, the electoral division a candidate is standing in, and so on.

In this post, I’ll describe a way of having a conversation with the data using OpenRefine and Google Fusion Tables as a way of starting to explore some the stories we may be able to tell with, and around, the data. (Bruce Mcphereson/Excel Liberation blog has also posted an Excel version of the methods described in the post: Mashing up electoral data. Thanks, Bruce:-)

Let’s get the data into OpenRefine so we can start to work it. Scraperwiki provides a CSV output format for each scraper table, so we can get a URL for it that we can then use to pull the data into OpenRefine:

scraperwiki CSV export

In OpenRefine, we can Create a New Project and then import the data directly:

openrefine import from URL

The data is in comma separated CSV format, so let’s specify that:

import as csv comma separated

We can then name and create the project and we’re ready to start…

…but start what? If we want to find out if a candidate lives in ward or out of ward, we either need to know whether their address is in ward or out of ward, or we need to find out which ward their address is in and then see if it is the same as the one they are standing in.

Now it just so happens (:-) that MySociety run a service called MapIt that lets you submit a postcode and it tells you a whole host of things about what administrative areas that postcode is in, including (in this case) the unitary authority electoral division.

mapit postcode lookup

And what’s more, MapIt also makes the data available in a format that’s data ready for OpenRefine to be able to read at a web address (aka a URL) that we can construct from a postcode:

mapit json

Here’s an example of just such a web address: http://mapit.mysociety.org/postcode/PO36%200JT

Can you see the postcode in there? http://mapit.mysociety.org/postcode/PO36%200JT

The %20 is a character encoding for a space. In this case, we can also use a +.

So – to get information about the electoral division an address lays in, we need to get the postcode, construct a URL to pull down corresponding data from MapIt, and then figure out some way to get the electoral division name out of the data. But one step at a time, eh?!;-)

Hmmm…I wonder if postcode areas necessarily fall within electoral divisions? I can imagine (though it may be incorrect to do so!) a situation where a division boundary falls within a postcode area, so we need to be suspicious about the result, or at least bear in mind that an address falling near a division boundary may be wrongly classified. (I guess if we plot postcodes on a map, we could look to see how close to the boundary line they are, because we already know how to plot boundary lines.

To grab the postcode, a quick skim of the addresses suggests that they are written in a standard way – the postcode always seems to appear at the end of the string preceded by a comma. We can use this information to extract the postcode, by splitting the address at each comma into an ordered list of chunks, then picking the last item in the list. Because the postcode might be preceded by a space character, it’s often convenient for us to strip() any white space surrounding it.

What we want to do then is to create a new, derived column based on the address:

Add derived column

And we do this by creating a list of comma separated chunks from the address, picking the last one (by counting backwards from the end of the list), and then stripping off any whitespace/space characters that surround it:

grab a postcode

Here’s the result…

postcodes...

Having got the postcode, we can now generate a URL from it and then pull down the data from each URL:

col from URL

When constructing the web address, we need to remember to encode the postcode by escaping it so as not to break the URL:

get data from URL

The throttle value slows down the rate at which OpenRefine loads in data from the URLs. If we set it to 500 milliseconds, it will load one page every half a second.

When it’s loaded in all the data, we get a new column, filled with data from the MapIt service…

lots of data

We now need to parse this data (which is in a JSON format) to pull out the electoral division. There’s a bit of jiggery pokery required to do this, and I couldn’t work it out myself at first, but Stack Overflow came to the rescue:

that's handy...

We need to tweak that expression slightly by first grabbing the areas data from the full set of MapIt data. Here’s the expression I used:

filter(('[' + (value.parseJson()['areas'].replace( /"[0-9]+":/,""))[1,-1] + ']' ).parseJson(), v, v['type']=='UTE' )[0]['name']

to create a new column containing the electoral division:

parse out the electroal division

Now we can create another column, this time based on the new Electoral Division column, that compares the value against the corresponding original “ward” column value (i.e. the electoral division the candidate was standing in) and prints a message saying whether they were standing in ward or out:

inward or out

If we collapse down the spare columns, we get a clearer picture:

collapse...

Like this:

summary data

If we generate a text facet on the In/Out column, and increase the number of rows displayed, we can filter the results to show just the candidates who stood in their local electoral division (or conversely, those who stood outside it):

facet on inout

We can also start to get investigative, and ask some more questions of the data. For example, we could apply a text facet on the party/desc column to let us filter the results even more…

inout facet filter

Hmmm… were most of the Labour Party candidates standing outside their home division (and hence unable to vote for themselves?!)

Hmm.. labour out

There aren’t too many parties represented across the Island elections (a text facet on the desc/party description column should reveal them all), so it wouldn’t be too hard to treat the data as a source, get paper and pen in hand, and write down the in/out counts for each party describing the extent to which they fielded candidates who lived in the electoral divisions they were standing in (and as such, could vote for themselves!) versus those who lived “outside”. This data could reasonably be displayed using a staggered bar chart (the data collection and plotting are left as an exercise for the reader [See Bruce Mcphereson's Mashing up electoral data post for a stacked bar chart view.];-) Another possible questioning line is how do the different electoral divisions fare in terms of in-vs-out resident candidates. If we pull in affluence/poverty data, might it tell us anything about the likelihood of candidates living in area, or even tell us something about the likely socio-economic standing of the candidates?

One more thing we could try to do is to geocode the postcode of the address of the each candidate rather more exactly. A blog post by Ordnance Survey blogger John Goodwin (@gothwin) shows how we might do this (note: copying the code from John’s post won’t necessarily work; WordPress has a tendency to replace single quotes with all manner of exotic punctuation marks that f**k things up when you copy and paste them into froms for use in other contexts). When we “Add column by fetching URLs”, we should use something along the lines of the following:

'http://beta.data.ordnancesurvey.co.uk/datasets/code-point-open/apis/search?output=json&query=' + escape(value,'url')

os postcode lookup

The data, as imported from the Ordnance Survey, looks something like this:

o:sdata

As is the way of national services, the Ordnance Survey returns a data format that is all well and good but isn’t the one that mortals use. Many of my geo-recipes rely on latitude and longitude co-ordinates, but the call to the Ordnance Survey API returns Eastings and Northings.

Fortunately, Paul Bradshaw had come across this problem before (How to: Convert Easting/Northing into Lat/Long for an Interactive Map) and bludgeoned(?!;-) Stuart harrison/@pezholio, ex- of Lichfield Council, now of the Open Data Institute, to produce a pop-up service that returns lat/long co-ordinates in exchange for a Northing/Easting pair.

The service relies on URLs of the form http://www.uk-postcodes.com/eastingnorthing.php?easting=EASTING&northing=NORTHING, which we can construct from data returned from the Ordnance Survey API:

easting northing lat -long

Here’s what the returned lat/long data looks like:

lat-long json

We can then create a new column derived from this JSON data by parsing it as follows
parse latlong to lat

A similar trick can be used to generate a column containing just the longitude data.

We can then export a view over the data to a CSV file, or direct to Google Fusion tables.

postcode lat long export

With the data in Google Fusion Tables, we can let Fusion Tables know that the Postcode lat and Postcode long columns define a location:2222

Fusion table edit column

Specifically, we pick either the lat or the long column and use it to cast a two column latitude and longitude location type:

fusion table config cols to location type

We can inspect the location data using a more convenient “natural” view over it…

fusion table add map

By applying a filter, we can look to see where the candidates for a particular ward have declared their home address to be:

havenstreet candidates

(Note – it would be more useful to plot these markers over a boundary line defined region corresponding to the area covered by the corresponding electoral ward. I don’t think Fusion Table lets you do this directly (or if it does, I don’t know how to do it..!). This workaround – FusionTablesLayer Wizard – on merging outputs from Fusion Tables as separate layers on a Google Map is the closest I’ve found following a not very thorough search;-)

We can go back to the tabular view in Fusion Tables to run a filter to see who the candidates were in a particular electoral division, or we can go back to OpenRefine and run a filter (or a facet) on the ward column to see who the candidates were:

refine filter by division

Filtering on some of the other wards using local knowledge (i.e. using the filter to check/corroborate things I knew), I spotted a couple of missing markers. Going back to the OpenRefine view of the data, I ran a facetted view on the postcode to see if there were any “none-postcodes” there that would in turn break the Ordnance Survey postcode geocoding/lookup:

postcode missing...

Ah – oops… It seems we have a “data quality” issue, although albeit a minor one…

So, what do we learn from all this? One take away for me is that data is a source we can ask questions of. If we have a story or angle in mind, we can tune our questions to tease out corroborating facts (possbily! caveat emptor applies!) that might confirm, helpdevelop, or even cause us to rethink, the story we are working towards telling based on the support the data gives us.

Written by Tony Hirst

May 5, 2013 at 11:38 pm

A Quick Peek at Some Charities Data…

As part of the Maximising Your Data Impact launch event of the Civil Society Data Network (reviewed in part by the Guardian here: Open data and the charity sector: a perfect fit), a series of mini-data expeditions provided participants with the opportunity to explore a range of data related questions relevant to the third sector. This was my first data exploration, and rather than deep dive into the data (which I probably should have done!) I was rather more interested in getting a feel for how potential data users actually discussed data related questions.

The topic my group was exploring was how to decide whether it made more sense to make a donation to a small or large charity. Unpacking this question a little forces a consideration of what the donor perceives to be a good use of their donation. To try to make the question more tractable, we also focussed on a particular scenario: how might a donor wishing to contribute to a charity related in some way to the hospice movement draw comparisons between them in order to inform their decision.

A scan of the Charity Commission website reveals that information is available for many charities relating to the size of the organisation as given by the number of staff or number of volunteers, as well as a certain amount of financial reporting:

charity commission comparison

Unfortunately, there doesn’t seem to be a way of comparing charities across a sector, nor does the data appear to be available as such. However, OpenCharities does make the data available in a rather more convenient form (that is, in a machine readable form) at web addresses/URLs of the form http://opencharities.org/charities/OPENCHARITIESID.json.

opencharities json

OpenCharities also publish a CSV file (a simple text based tabular data file) that contains crude summary information about UK registered charities, including such things as the charity name, its OpenCharities ID number, a summary description of each charity’s major activities, its registered address, and its social media profile information. If we grab this data, and pull from it the charities that are of interest to us, we can then use the OpenCharities IDs to create URLs from which we can pull the more detailed data.

Grabbing the CSV data file as charities.csv, we can filter out the rows containing items relating to hospices. From a Linux/Mac terminal command line, we can use the grep tool to grab rows that mention “hospice” somewhere in the line, and then create a new file using those rows appended to the column header row from the original file, putting the result into the file hospices.csv:

grep hospice charities.csv > tmp.csv
head -1 charities.csv | cat tmp.csv > hospice_charities.csv

and then upload the resulting filtered file to a Google spreadsheet.

Alternatively, we could load the whole OpenCharities CSV file into a tool such as OpenRefine and then filter the rows using the text filter on an appropriate column to select just rows mentioning “hospice” within a given column. Using the Custom Tabular Export, we could then upload the data directly to Google Spreadsheet.

Having got a list of hospice related charities into a Google Spreadsheet, we can Share the spreadsheet and also publish it (form the File menu). Publishing Google Spreadsheets makes the data available in a variety of formats, such as CSV data, via a web location/URL.

Our recipe so far is as follows:

- get CSV file of charities on OpenCharities;
- filter the file to get charities associated with hospices;
- upload the filtered file to Google Spreadsheets;
- publish the spreadsheet so we have it available as CSV at known URL.

The next step is to grab the more detailed data down from OpenCharities using the OpenCharities ID to construct the web addresses for where we can find that data for each hospice. We could use a tool such as OpenRefine to do this, but instead I’m going to write a short programme on Scraperwiki (OpenCharities Scraper) using the Python programming language to do the task.

#We need to load in some programme libraries to make life easier
import scraperwiki,csv,urllib2,json

#Here's where we load in the spreadsheet data published on Google Spreadsheets as a CSV file
charities = csv.DictReader((urllib2.urlopen('https://docs.google.com/spreadsheet/pub?key=0AirrQecc6H_vdFVlV0pyd3RVTktuR0xmTTlKY1gwZ3c&single=true&gid=1&output=csv')))

#This function will grab data about a charity from OpenCharities given its OpenCharities ID
def opencharitiesLookup(id):
    url = 'http://opencharities.org/charities/'+id+'.json'
    jsondata = json.load(urllib2.urlopen(url))
    return jsondata

#This routine cycles through each row of data/charity pulled from the spreadsheet
for charity in charities:
    #print charity

    #For each charity, I'm going to pull out several data fields that we'll then save to a database of our own
    #We start with some info from the spreadsheet - charity ID, name and activity summary
    data={'cid':charity['charity_number']}
    for tmp in ['title','activities']:
        data[tmp]=charity[tmp]

    #Here's where we pull in the more detailed data for each charity
    jdata = opencharitiesLookup(charity['charity_number'])

    #Then we start to go hunting for the data...
    chdata = jdata['charity']
    fdata = chdata['financial_breakdown']

    #The data will include employee and volunteer numbers..
    for tmp in ['volunteers','employees']:
        data[tmp] = chdata[tmp]

    #...as well as financial information
    for tmp in ['assets','spending','income']:
        if fdata != None and tmp in fdata:
            for tmp2 in fdata[tmp]:
                data[tmp+'_'+tmp2] = fdata[tmp][tmp2]
    #print data

    #Here's where we save all the data for a charity to a database
    scraperwiki.sqlite.save(unique_keys=['cid'], table_name='hospices', data=data)

When we run the scraper, it looks up the hospice data in the Google Spreadsheet, gets the richer data from OpenCharities, and pops it into a local database on Scraperwiki from where we can download the data:

scraperwiki opencharities

We can now download this enriched data as a CSV file and then, from Google Drive:

open a fusion table from drive

upload it to Google Fusion tables.

The recipe has now been extended as follows:

- pull the list of hospices into Scraperwiki from the Google Spreadsheet
- for each hospice, create an OpenCharities URL that points to the detailed data for that charity
- for each hospice, grab the corresponding data from that OpenCharities URL
- for each hospice, pull out the data elements we want and pop it into a Scraperwiki database
- download the database as a CSV file that now contains detailed data for each charity
- upload the detailed data CSV file to Google Fusion Tables

With the data in Google Fusion Tables, we can now start to analyse it and see what stories it might have to tell:

import csv to fusion table

Here it is…

data in fusion tables

We can then start to build charts around the data:

fusion tables add chart

Using the chart builder, you can specify which quantities to plot against each axis:

building up a fusion chart

(?I don’t know how to associate the title column (name of each charity) with points on the chart, so that when we hover over a point we can see which charity it relates to?)

We can also define out own calculations, much as you would in a spreadsheet.

fusion table add formula

For example, we might be interesting in knowing the ratio of income received from voluntary contributions versus the amount spent soliciting voluntary contributions. If this ratio is greater than 1, more money is spent soliciting these contributions than is received as a result. If the ratio is small (close to zero) then either the money spent soliciting voluntary contributions is effective, or voluntary income is being generated by other other means…

fusion table calc formula

We can then use one or more filters to explore which hospices meet different criteria. For example, how about organisations that seem to get a good voluntary income return versus spend on generating voluntary income, with the proviso that the voluntary income is less than £250,000:

build up filter

Along similar lines, we might want to calculate the ratio of volunteers to employees, and then then view the data for organisations with a certain number of employees (10 to 50, for example), or a certain number of volunteers (less than 500, say).

This is now the point at which story making – story generation, and data story telling – can start to kick in, based on the questions we want to ask of the data, or the stories we want to try to get it to tell us…

Written by Tony Hirst

May 1, 2013 at 11:06 am

More Open Data Frustrations – Unreadable Documentation from the DfE

One of the many things I’d like to spend my time doing is tinkering with data journalism doodles relating to local news stories. For example, via our local hyperlocal blog, I saw this post announced today: Isle of Wight has highest percentage of secondary school absentee rates in country. The post included a link to a Department for Education page (Pupil absence in schools in England, including pupil characteristics) containing links to the statistical release and the associated data sets:

dfes absentee data

Here’s what we get in the zipped datafile:

downloaded data

The school level dataset had the following column headings:

Year, country_code, country, GOR, GOR_code, LA, new_LA_code, LA_Name, URN, Estab, LAEstab, School_name, School_type, Academy_Flag, Academy_open_date, enrol_sum, SessionsPossible_sum, OverallAbsence_sum, AuthorisedAbsence_sum, UnauthorisedAbsence_sum, overall_absence_percent, auth_absence_percent, unauth_absence_percent, PA_15_sum, possible_sessions_pa_15_sum, overall_abs_pa_15_sum, auth_abs_pa_15_sum, unauth_abs_pa_15_sum, overall_absence_percent_PA_15, auth_absence_percent_PA_15, unauth_absence_percent_PA_15, sess_auth_illness, sess_auth_appointments, sess_auth_religious, sess_auth_study, sess_auth_traveller, sess_auth_holiday, sess_auth_ext_holiday, sess_auth_excluded, sess_auth_other, sess_auth_totalreasons, sess_auth_unclass, sess_unauth_holiday, sess_unauth_late, sess_unauth_other, sess_unauth_noyet, sess_unauth_totalreasons, sess_unauth_unclass, sess_overall_totalreasons

We can guess at what some of these refer to, but what, for example, do the “PA 15″ columns refer to? In this case, what we really should do is look up the actual definitions, which are described in the metadata description document; a document that just happens to be a Microsoft Word 2007 formatted document…

…a document that doesn’t play nicely either with the copy of Word I have on my Mac:

SOmetimesIPreferPDF

…or the converter that the Google docs uploader uses:

Google docs struggled too

In cases such as this, particularly where there are mathematical equations that often have very specific layout requirements, it can be “safer” to use a document format such as PDF that more reliably captures the appearance of the original page. (If we were really keen on reproducibility, we might also suggest that the equations were made available in an executable form, such as programme code or even as a spreadsheet (I’m not sure “Microsoft equations” are executable?).)

I gave myself a couple of hours to have a quick look through some of the data, but as it is I’ve spent an hour or so looking for ways of reading the metadata description document along with writing up my frustration around not being able to do so… Which is time spent not making sense of the data, or, indeed, its metadata…

PS in passing, I note the publication of the parliamentary Public Accounts Committee 37th report, Whole of Government Accounts 2010-11 again picks up on the way in which government data releases often fall short in terms of their usability (for example, this week MPs call for greater use of Whole of Government Accounts; see also last August Government must do better on transparency, say MPs).

PPS Here’s the solution I used in the end – Skydrive, Microsoft’s online storage/doc viewing play:

equations pt 1

equations pt 2

As it turns out, the equations could easily have been written using simple text strings…

PPPS as to the “15″ columns, the metadata files describes them along the following lines:

PA_15_sum Number of enrolments classed as persistent absentees (threshold of 15 per cent)

possible_session_pa_15_sum Sessions possible for persistent absentees (threshold of 15 per cent)

Which means what exactly?!

Written by Tony Hirst

April 11, 2013 at 5:48 pm

Posted in Anything you want, opengov

Tagged with

Splitting a Large CSV File into Separate Smaller Files Based on Values Within a Specific Column

One of the problems with working with data files containing tens of thousands (or more) rows is that they can become unwieldy, if not impossible, to use with “everyday” desktop tools. When I was Revisiting MPs’ Expenses, the expenses data I downloaded from IPSA (the Independent Parliamentary Standards Authority) came in one large CSV file per year containing expense items for all the sitting MPs.

In many cases, however, we might want to look at the expenses for a specific MP. So how can we easily split the large data file containing expense items for all the MPs into separate files containing expense items for each individual MP? Here’s one way using a handy little R script in RStudio

Load the full expenses data CSV file into RStudio (for example, calling the dataframe it is loaded into mpExpenses2012. Previewing it we see there is a column MP.s.Name identifying which MP each expense claim line item corresponds to:

mpexpenses2012R

We can easily pull out the unique values of the MP names using the levels command, and then for each name take a subset of the data containing just the items related to that MP and print it out to a new CSV file in a pre-existing directory:

mpExpenses2012 = read.csv("~/Downloads/DataDownload_2012.csv")
#mpExpenses2012 is the large dataframe containing data for each MP
#Get the list of unique MP names
for (name in levels(mpExpenses2012$MP.s.Name)){
  #Subset the data by MP
  tmp=subset(mpExpenses2012,MP.s.Name==name)
  #Create a new filename for each MP - the folder 'mpExpenses2012' should already exist
  fn=paste('mpExpenses2012/',gsub(' ','',name),sep='')
  #Save the CSV file containing separate expenses data for each MP
  write.csv(tmp,fn,row.names=FALSE)
}

Simples:-)

This technique can be used to split any CSV file into multiple CSV files based on the unique values contained within a particular, specified column.

Written by Tony Hirst

April 3, 2013 at 8:54 am

Posted in Rstats

Tagged with ,

Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx

I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.

What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:

comapny director netwrok

In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.

It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:

Company Director
C1 D1
C2 D1
C2 D2
C2 D2
C3 D3
C3 D3

This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.

In order to turn the data, which looks like this:

OpenCorporates data

into a map that resembles something like this (this is actually a view over a different dataset):

care uk sprawl

we need to do a couple of things. Working backwards, these are:

  1. use some sort of tool to generate a pretty picture from the data;
  2. get the data out of the table into the tool using an appropriate exchange format.

Tools include desktop tools such as Gephi (which can import data directly from a CSV file or database table), or graph viewers such as the sigma.js javascript library, or d3.js with an appropriate plugin.

Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:

  • either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or

  • the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.

The dataflow I’m working towards looks something like this:

opencorporates graphflow

networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).

As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:

https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&output=json
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&output=gexf

This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).

We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:

Director network

or those companies that are connected by virtue of sharing common directors:

company network

(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)

The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:

As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)

PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.

Written by Tony Hirst

February 25, 2013 at 11:21 am

Posted in Data, Infoskills

Tagged with ,

Geocoding Using the Google Maps Geocoder via OpenRefine

On my to do list in the data education game is an open-ended activity based on collecting examples of how to do common tasks using a variety of different tools and techniques. One such task is geo-coding, getting latitude and longitude co-ordinates for a location based on it’s address or postcode. A couple of recent posts on the School of Data blog provide an Introduction to Geocoding and a howto on geocoding in Google spreadsheets. Here’s another approach, geocoding using OpenRefine [download and install OpenRefine] (see also Paul Bradshaw’s related post from a couple of years ago: Getting Full Addresses for Data from an FOI Response (Using APIs)).

First, we need to grab some data. I’m going to use a data file from the Food Standards Agency that contains food standards ratings for food outlets within a particular local council area. This data file is a relatively simple XML format that OpenRefine should have no trouble handling… Let’s get the data directly from the data file’s web address:

GetSomeDataIntoRefine

OpenRefine will now download the file and give us a preview. What we need to do now is highlight an example of a record element – OpenRefine will create one row for each such record in the original file.

Highlight the record element

(For the purposes of this example, it might have been easier to just select the postcode element within each establishment record.)

Click “Create Project” in the top right corner, and the data will be imported into a tabular view:

wehavedataIncluding postcodes

If we look along the columns, we notice a postcode element- we can use that as the basis for geocoding. (Yes, I know, I know, this dataset already has latitude and longitude data, so we don’t really need to get it for ourselves;-) But I thought this datafile was a handy source of postcodes;-) What we’re going to do now is ask OpenRefine to create a new column that will be populated with data from the Google maps geocoding API that gives the latitude and longitude for each postcode:

weshalladdacolumn

We can pass a postcode to the Google maps geocoder using a URL (web address) constructed as follows:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE"

(We could force the locale – that the postcode is in the UK – by using "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE,UK".)

*IF YOU WANT OPEN GEO-CODED DATA, GOOGLE MAPS WON’T CUT IT… *

Note also that we need to “escape” the postcode so that it works in the URL (this essentially means handling spaces in the postcode). To do this, we will construct a URL using the follow GREL programming language statement:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")

or alternatively:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")+",UK"

What this does is build up the URL from the value of the cells in the column from which our new column is derived. The escape(value,"url") command makes sure that the postcode value is correctly encoded for use in a url.

generate the geocoding URL

We can now let OpenRefine construct a URL containing the postcode for each row in the data file and load the requested data in from the URL (the throttle delay sets the rate at which OpenRefine will make repeated calls to the API – many APIs have a limit on both the rate at which you can call them (once a second, for example), as well as a limit on the total number of calls per hour, or day). The returned data will be added to the appropriate row in the new column:

ecample geo

The looks a bit confusing, doesn’t it? However, it does have some structure to it. Here’s how the data unpacks if we lay it out a bit better:

{ "results" : [
   { "address_components" : [
     { "long_name" : "NG4 4DL",
       "short_name" : "NG4 4DL",
       "types" : [ "postal_code" ] 
     },
     { "long_name" : "Gedling",
       "short_name" : "Gedling",
       "types" : [ "locality", "political" ]
     },
     { "long_name" : "Nottinghamshire",
       "short_name" : "Notts",
       "types" : [ "administrative_area_level_2", "political" ]
     },
     { "long_name" : "United Kingdom",
       "short_name" : "GB", "types" : [ "country", "political" ] 
     },
     { "long_name" : "Nottingham",
       "short_name" : "Nottingham",
       "types" : [ "postal_town" ] 
     }
     ],
     "formatted_address" : "Gedling, Nottinghamshire NG4 4DL, UK",
     "geometry" : {
         "bounds" : {
             "northeast" : { "lat" : 52.97819320, "lng" : -1.09096230 },
             "southwest" : { "lat" : 52.9770690, "lng" : -1.0924360 } 
         },
         "location" : {
           "lat" : 52.97764640,
           "lng" : -1.09182150 
         },
         "location_type" : "APPROXIMATE",
         "viewport" : {
           "northeast" : { "lat" : 52.97898008029150, "lng" : -1.090350169708498 },
           "southwest" : { "lat" : 52.97628211970850, "lng" : -1.093048130291502 }
         }
       },
       "types" : [ "postal_code" ] 
    }
 ], "status" : "OK" }

(For a little more insight into the structure of data like this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure.)

So OpenRefine has pulled back some data – we can now generate another column or two to extract the latitude and longitude the Google Maps geocoding API sent back:

colfromcol

We need to find a way of extracting the data element we want from the data we got back from the geocoding API – we can use another GREL programming language statement to do this:

openrefine jsonparse

Here’s how to construct that statement (try to read the following with reference to the laid out data structure shown above…):

  • value.parseJson() parses the data text (the value of the cell from the original column) into a data element we can work with programmatically
  • value.parseJson().results[0] gets the the data in the first “result” (computer programmes generally start a count with a 0 rather than a 1…)
  • value.parseJson().results[0]["geometry"] says get the “geometry” element from the first result;
  • value.parseJson().results[0]["geometry"]["location"] says get the location element from within that geometry element;;
  • value.parseJson().results[0]["geometry"]["location"]["lat"] says get the latitude element from within the corresponding location element.

Here’s what we get – a new column containing the latitude, as provided by the geocoder API from the postcode:

newcolgeo

Hopefully you can see how we would be able to create a new column containing the longitude: go back to the Googlegeo column, select Edit column – Add column based on this column for a second time, and construct the appropriate parsing statement (value.parseJson().results[0]["geometry"]["location"]["lng"]).

If OpenRefine is new to you, it’s worth reviewing what steps are involved in this recipe:

- loading in an XML data file from a web address/URL;
- generating a set of URLs that call a remote API/data source based on the contents of a particular column, and loading data from those URLs into a new column;
- parsing the data returned from the API to extract a particular data element and adding it to a new column.

Many of these steps are reusable in other contexts, of course…;-) (Can you think of any?!)

PS Tom Morris pointed out to me (on the School of Data mailing list) that the Google Maps geocoding API terms of use only allow the use of the API for geocoding points in order to locate them on a Google Map… Which is probably not what you’d want this data for…;-) However, the recipe is a general one – if you can find another geocoding service that accepts an escaped address term as part of a URL and returns geo-data in a JSON format, you should be able to work out how to use that API instead using this self-same recipe… (Just build a different URL and parse the returned JSON appropriately…)

I’ll try to keep a list of example API URLs for more open geocoders here:

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London (no postcodes?)
http://mapit.mysociety.org/postcode/MK7%206AA (UK postcode geocoder)

Written by Tony Hirst

February 20, 2013 at 10:40 am

Reshaping Horse Import/Export Data to Fit a Sankey Diagram

As the food labeling and substituted horsemeat saga rolls on, I’ve been surprised at how little use has been made of “data” to put the structure of the food chain into some sort of context* (or maybe I’ve just missed those stories?). One place that can almost always be guaranteed to post a few related datasets is the Guardian Datastore, who use EU horse import/export data to produce interactive map of the European trade in horsemeat

*One for the to do list – a round up of “#ddj” stories around the episode.)

Guardian datablog - EU trade in horsemeat

(The article describes the source of the data as the Eurpoean Union Unistat statistics website, although I couldn’t find a way of recreating the Guardian spreadsheet from that source. When I asked Simon Rogers how he’d come by the data, he suggested putting questions into the Eurostat press office;-)

The data published by the Guardian datastore is a matrix showing the number of horse imports/exports between EU member countries (as well as major traders outside the EU) in 2012:

Guardian Datablog horsemeat importexport data

One way of viewing this data structure is as an edge weighted adjacency matrix that describes a graph (a network) in which the member countries are nodes and the cells in the matrix define edge weights between country nodes. The weighted edges are also directed, signifying the flow of animals from one country to another.

Thinking about trade as flow suggests a variety of different visualisation types that build on the metaphor of flow, such as a Sankey diagram. In a Sankey diagram, edges of different thicknesses connect different nodes, with the edge thickness dependent on the amount of “stuff” flowing through that connection. (The Guardan map above also uses edge thickness to identify trade volumes.) Here’s an example of a Sankey diagram I created around the horse export data:

Horse exports - EU - Sankey demo

(The layout is a little rough and ready – I was more interested in finding a recipe for creating the base graphic – sans design tweaks;-) – from the data as supplied.)

So how did I get to the diagram from the data?

As already mentioned, the data came supplied as an adjacency matrix. The Sankey diagram depicted above was generated by passing data in an appropriate form to the Sankey diagram plugin to Mike Bostock’s d3.js graphics library. The plugin requires data in a JSON data format that describes a graph. I happen to know that that the Python networkx library can generate an appropriate data object from a graph modeled using networkx, so I know that if I can generate a graph in networkx I can create a basic Sankey diagram “for free”.

So how can we create the graph from the data?

The networkx documentation describes a method – read_weighted_edgelist – for reading in a weighted adjacency matrix from a text file, and creating a network from it. If I used this to read the data in, I would get a directed network with edges going into and out of country nodes showing the number of imports and exports. However, I wanted to create a diagram in which the “import to” and “export from” nodes were distinct so that exports could be seen to flow across the diagram. The approach I took was to transform the two-dimensional adjacency matrix into a weighted edge list in which each row has three columns: exporting country, importing country, amount.

So how can we do that?

One way is to use R. Cutting and pasting the export data of interest from the spreadsheet and into a text file (adding in the missing first column header as I did so) gives a source data file that looks something like this:

horse export source data

In contrast, the edge list looks something like this:
reshaped horse data

So how do we get from one to the other?

Here’s the R script I used – it reads the file in, does a bit of fiddling to remove commas from the numbers and turn the result into integer based numbers, and then uses the melt function from the reshape library to generate the edge list, finally filtering out edges where there were no exports:

#R code

horseexportsEU <- read.delim("~/Downloads/horseexportsEU.txt")
require(reshape)
#Get a "long" edge list
x=melt(horseexportsEU,id='COUNTRY')
#Turn the numbers into numbers by removing the comma, then casting to an integer
x$value2=as.integer(as.character(gsub(",", "", x$value, fixed = TRUE) ))
#If we have an NA (null/empty) value, make it -1
x$value2[ is.na(x$value2) ] = -1
#Column names with countries that originally contained spaces convert spaces dots. Undo that. 
x$variable=gsub(".", " ", x$variable, fixed = TRUE)
#I want to export a subset of the data
xt=subset(x,value2>0,select=c('COUNTRY','variable','value2'))
#Generate a text file containing the edge list
write.table(xt, file="foo.csv", row.names=FALSE, col.names=FALSE, sep=",")

(Another way of getting a directed, weighted edge list from an adjacency table might be to import it into networkx from the weighted adjacency matrix and then export it as weighted edge list. R also has graph libraries available, such as igraph, that can do similar things. But then, I wouldn’t have go to show the “melt” method to reshaping data;-)

Having got the data, I now use a Python script to generate a network, and then export the required JSON representation for use by the d3js Sankey plugin:

#python code
import networkx as nx
import StringIO
import csv

#Bring in the edge list explicitly
#rawdata = '''"SLOVENIA","AUSTRIA",1200
#"AUSTRIA","BELGIUM",134600
#"BULGARIA","BELGIUM",181900
#"CYPRUS","BELGIUM",200600
#... etc
#"ITALY","UNITED KINGDOM",12800
#"POLAND","UNITED KINGDOM",129100'''

#We convert the rawdata string into a filestream
f = StringIO.StringIO(rawdata)
#..and then read it in as if it were a CSV file..
reader = csv.reader(f, delimiter=',')

def gNodeAdd(DG,nodelist,name):
    node=len(nodelist)
    DG.add_node(node,name=name)
    #DG.add_node(node,name=name)
    nodelist.append(name)
    return DG,nodelist

nodelist=[]

DG = nx.DiGraph()

#Here's where we build the graph
for item in reader:
    #Even though import and export countries have the same name, we create a unique version depending on
    # whether the country is the importer or the exporter.
    importTo=item[0]+'.'
    exportFrom=item[1]
    amount=item[2]
    if importTo not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,importTo)
    if exportFrom not in nodelist:
        DG,nodelist=gNodeAdd(DG,nodelist,exportFrom)
    DG.add_edge(nodelist.index(exportFrom),nodelist.index(importTo),value=amount)

json = json.dumps(json_graph.node_link_data(DG))
#The "json" serialisation can then be passed to a d3js containing web page...

Once the JSON object is generated, it can be handed over to d3.js. The whole script is available here: EU Horse imports Sankey Diagram.

What this recipe shows is how we can chain together several different tools and techniques (Google spreadsheets, R, Python, d3.js) to create a visualisation with too much effort (honestly!). Each step is actually quite simple, and with practice can be achieved quite quickly. The trick to producing the visualisation becomes one of decomposing the problem, trying to find a path from the format the data is in to start with, to a form in which it can be passed directly to a visualisation tool such as the d3js Sankey plugin.

PS In passing, as well as the data tables that can be searched on Eurostat, I also found the Eurostat Yearbook, which (for the most recent release at least), includes data tables relating to reported items:

Eurostat Yearbook

So it seems that the more I look, the more and more places seems to making data that appears in reports available as data

Written by Tony Hirst

February 18, 2013 at 10:31 am

Posted in Infoskills, Rstats

Tagged with , ,

Follow

Get every new post delivered to your Inbox.

Join 729 other followers