When I get a chance, I’ll post a (not totally unsympathetic) response to Milo Yiannopoulos’ post The pitiful cult of ‘data journalism’, but in the meantime, here’s a view over some data that was released a couple of days ago – a map of where the New Year Honours went [link]
[Hmm… so WordPress.com doesn’t seem to want to let me embed a Google Fusion Table map iframe, and Google Maps (which are embeddable) just shows an empty folder when I try to view the Fusion Table KML… (the Fusion Table export KML doesn’t seem to include lat/lng data either? Maybe I need to explore some hosting elsewhere this year…]
Note that I wouldn’t make the claim that this represents an example of data journalism. It’s a sketch map showing which parts of the country various recipients of honours this time round presumably live. Just by posting the map, I’m not reporting any particular story. Instead, I’m trying to find a way of looking at the day to see whether or not there may be any interesting stories that are suggested by viewing the data in this way.
There was a small element of work involved in generating the map view, though… Working backwards, when I used Google Fusion tables to geocode the locations of the honoured, some of the points were incorrectly located:
(It would be nice to be able to force a locale to the geocoder, maybe telling it to use maps.google.co.uk as the base, rather than (presumably) maps.google.com?)
The approach I took to tidying these was rather clunky, first going into the table view and filtering on the mispositioned locations:
Then correcting them:
What would be really handy would be if Google Fusion Tables let you see a tabular view of data within a particular map view – so for example, if I could zoom in to the US map and then get a tabular view of the records displayed on that particular local map view… (If it does already support this and I just missed it, please let me know via the comments..;-)
…so I used Scraperwiki to preview and read through the PDF and extract the honours list data (my scraper is a little clunky and doesnlt pull out 100% of the data, missing the occasional name and contribution details when it’s split over several lines; but I think it does a reasonable enough job for now, particularly as I am currently more interested in focussing on the possible high level process for extracting and manipulating the data, rather than the correctness of it…!;-)
Here’s the scraper (feel free to improve upon it….:-): Scraperwiki: New Year Honours 2012
I then did a little bit of tweaking in Google Refine, normalising some of the facets and crudely attempting to separate out each person’s role and the contribution for which the award was made.
For example, in the case of Dr Glenis Carole Basiro DAVEY, given column data of the form “The Open University, Science Faculty and Health Education and Training Programme, Africa. For services to Higher and Health Education.“, we can use the following expressions to generate new sub-columns:
– value.match(/.*(For .*)/) to pull out things like “For services to Higher and Health Education.”
– value.match(/(.*)For .*/) to pull out things like “The Open University, Science Faculty and Health Education and Training Programme, Africa.”
I also ran each person’s record through Reuters Open Calais service using Google Refine’s ability to augment data with data from a URL (“Add column by fetching URLs”), pulling the data back as JSON. Here’s the URL format I used (polling once every 500ms in order to stay with the max. 4 calls per limit threshold mandated by the API.)
"http://api.opencalais.com/enlighten/rest/?licenseID=<strong>MY_LICENSE_KEY</strong>&content=" + escape(value,'url') + "¶msXML=%3Cc%3Aparams%20xmlns%3Ac%3D%22http%3A%2F%2Fs.opencalais.com%2F1%2Fpred%2F%22%20xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22%3E%20%20%3Cc%3AprocessingDirectives%20c%3AcontentType%3D%22TEXT%2FRAW%22%20c%3AoutputFormat%3D%22Application%2FJSON%22%20%20%3E%20%20%3C%2Fc%3AprocessingDirectives%3E%20%20%3Cc%3AuserDirectives%3E%20%20%3C%2Fc%3AuserDirectives%3E%20%20%3Cc%3AexternalMetadata%3E%20%20%3C%2Fc%3AexternalMetadata%3E%20%20%3C%2Fc%3Aparams%3E"
Unpicking this a little:
– licenseID is set to my license key value
– content is the URL escaped version of the text I wanted to process (in this case, I created a new column from the name column that also pulled in data from a second column (the contribution column). The GREL formula I used to join the columns took the form: value+', '+cells["contribution"].value)
– paramsXML is the URL encoded version of the following parameters, which set the content encoding for the result to be JSON (the default is XML):
<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#"> <c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="Application/JSON" > </c:processingDirectives> <c:userDirectives> </c:userDirectives> <c:externalMetadata> </c:externalMetadata> </c:params>
So much for process – now where are the stories? That’s left, for now, as an exercise for the reader. An obvious starting point is just to see who received honours in your locale. Remember, Google Fusion Tables lets you generate all sorts of filtered views, so it’s not too hard to map where the MBEs vs OBEs are based, for example, or have a stab at where awards relating to services to Higher Education went. Some awards also have a high correspondence with a particular location, as for example in the case of Enfield…
If you do generate any interesting views from the New Year Honours 2012 Fusion Table, please post a link in the comments. And if you find a problem with/fix for the data or the scraper, please post that info in a comment too:-)