Mapping the New Year Honours List – Where Did the Honours Go?

When I get a chance, I’ll post a (not totally unsympathetic) response to Milo Yiannopoulos’ post The pitiful cult of ‘data journalism’, but in the meantime, here’s a view over some data that was released a couple of days ago – a map of where the New Year Honours went [link]

[Hmm… so WordPress.com doesn’t seem to want to let me embed a Google Fusion Table map iframe, and Google Maps (which are embeddable) just shows an empty folder when I try to view the Fusion Table KML… (the Fusion Table export KML doesn’t seem to include lat/lng data either? Maybe I need to explore some hosting elsewhere this year…]

Note that I wouldn’t make the claim that this represents an example of data journalism. It’s a sketch map showing which parts of the country various recipients of honours this time round presumably live. Just by posting the map, I’m not reporting any particular story. Instead, I’m trying to find a way of looking at the day to see whether or not there may be any interesting stories that are suggested by viewing the data in this way.

There was a small element of work involved in generating the map view, though… Working backwards, when I used Google Fusion tables to geocode the locations of the honoured, some of the points were incorrectly located:

(It would be nice to be able to force a locale to the geocoder, maybe telling it to use maps.google.co.uk as the base, rather than (presumably) maps.google.com?)

The approach I took to tidying these was rather clunky, first going into the table view and filtering on the mispositioned locations:

Then correcting them:

What would be really handy would be if Google Fusion Tables let you see a tabular view of data within a particular map view – so for example, if I could zoom in to the US map and then get a tabular view of the records displayed on that particular local map view… (If it does already support this and I just missed it, please let me know via the comments..;-)

So how did I get the data into Google Fusion Tables? The original data was posted as a PDF on the DirectGov website (New Year Honours List 2012 – in detail)…:

…so I used Scraperwiki to preview and read through the PDF and extract the honours list data (my scraper is a little clunky and doesnlt pull out 100% of the data, missing the occasional name and contribution details when it’s split over several lines; but I think it does a reasonable enough job for now, particularly as I am currently more interested in focussing on the possible high level process for extracting and manipulating the data, rather than the correctness of it…!;-)

Here’s the scraper (feel free to improve upon it….:-): Scraperwiki: New Year Honours 2012

I then did a little bit of tweaking in Google Refine, normalising some of the facets and crudely attempting to separate out each person’s role and the contribution for which the award was made.

For example, in the case of Dr Glenis Carole Basiro DAVEY, given column data of the form “The Open University, Science Faculty and Health Education and Training Programme, Africa. For services to Higher and Health Education.“, we can use the following expressions to generate new sub-columns:

– value.match(/.*(For .*)/)[0] to pull out things like “For services to Higher and Health Education.”
– value.match(/(.*)For .*/)[0] to pull out things like “The Open University, Science Faculty and Health Education and Training Programme, Africa.”

I also ran each person’s record through Reuters Open Calais service using Google Refine’s ability to augment data with data from a URL (“Add column by fetching URLs”), pulling the data back as JSON. Here’s the URL format I used (polling once every 500ms in order to stay with the max. 4 calls per limit threshold mandated by the API.)

"http://api.opencalais.com/enlighten/rest/?licenseID=<strong>MY_LICENSE_KEY</strong>&content=" + escape(value,'url') + "&paramsXML=%3Cc%3Aparams%20xmlns%3Ac%3D%22http%3A%2F%2Fs.opencalais.com%2F1%2Fpred%2F%22%20xmlns%3Ardf%3D%22http%3A%2F%2Fwww.w3.org%2F1999%2F02%2F22-rdf-syntax-ns%23%22%3E%20%20%3Cc%3AprocessingDirectives%20c%3AcontentType%3D%22TEXT%2FRAW%22%20c%3AoutputFormat%3D%22Application%2FJSON%22%20%20%3E%20%20%3C%2Fc%3AprocessingDirectives%3E%20%20%3Cc%3AuserDirectives%3E%20%20%3C%2Fc%3AuserDirectives%3E%20%20%3Cc%3AexternalMetadata%3E%20%20%3C%2Fc%3AexternalMetadata%3E%20%20%3C%2Fc%3Aparams%3E"

Unpicking this a little:

– licenseID is set to my license key value
– content is the URL escaped version of the text I wanted to process (in this case, I created a new column from the name column that also pulled in data from a second column (the contribution column). The GREL formula I used to join the columns took the form: value+', '+cells["contribution"].value)
– paramsXML is the URL encoded version of the following parameters, which set the content encoding for the result to be JSON (the default is XML):

<c:params xmlns:c="http://s.opencalais.com/1/pred/" xmlns:rdf="http://www.w3.org/1999/02/22-rdf-syntax-ns#">
<c:processingDirectives c:contentType="TEXT/RAW" c:outputFormat="Application/JSON"  >
</c:processingDirectives>
<c:userDirectives>
</c:userDirectives>
<c:externalMetadata>
</c:externalMetadata>
</c:params>

So much for process – now where are the stories? That’s left, for now, as an exercise for the reader. An obvious starting point is just to see who received honours in your locale. Remember, Google Fusion Tables lets you generate all sorts of filtered views, so it’s not too hard to map where the MBEs vs OBEs are based, for example, or have a stab at where awards relating to services to Higher Education went. Some awards also have a high correspondence with a particular location, as for example in the case of Enfield…

If you do generate any interesting views from the New Year Honours 2012 Fusion Table, please post a link in the comments. And if you find a problem with/fix for the data or the scraper, please post that info in a comment too:-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

3 thoughts on “Mapping the New Year Honours List – Where Did the Honours Go?”

I thought about responding to that post myself but decided there were better things to do than give links to a linkbaiter who doesn’t bother to actually present any facts to respond to (the irony of criticising the quality of journalism while not actually doing anything beyond looking at a few pages of the Datablog and talking generally about his own perceptions). There’s perhaps a broader point about threatened journalists playing down the importance of interrogating data, or the dangers of letting the public loose on data, but I’m still waiting for that to rear its head.

Tony Hirst says:

January 3, 2012 at 10:44 am

@Paul Re: the linkbaiting, I did consider a rel=”nofollow” on that link, but in the end let it pass. I should have considered the downstream link implications from content I syndicate so thank’s for by implication giving me that to consider. (A lot of posts I pass downstream heavily link bank to this blog, which is not an intentional SEO strategy, more a feature of my posting style. When you syndicate posts on OJB, could/do you automatically no-follow any links they contain? What might the implications of that neutering approach be in terms of the “weight” or authority of posts on OJB, and the way it contributes to search engine rankings?)

I’ll go into reasons why I’m not totally unsympathetic to @nero’s post in another post, suffice it to say I think it taps in to: confusion between presentation, process and the perception of what “data driven journalism” actually amounts to; confusion about infographics vs. statistical graphics, confusion about ‘process’ rather than ‘presentation’ graphics (e.g. the generation of charts for use in an iterative visual analysis process rather than the generation of fixed summary presentation/report charts that illustrate something that was discovered through the ddj process); the fact that a lot of data-powered interactive news graphics are more shiny than useful to the reader (e.g. Michael Blastland at the OU Stats Conference last year – https://blog.ouseful.info/2011/05/18/quick-summary-of-opening-session-of-visualisation-and-presentation-in-statistics/ ), but tied with that the fact that generating insight from visual stats is often the result of a dynamic process, the interplay of question and visual answer as we represent and re-view data from various points of view; and so on.

It also hints at something that intrigues me – how folk without statistical training (and I include myself in that group) can make use of powerful statistical techniques, in particular using visual methods/statistical graphics, to make sense of data and extract meaningful pattern and structure from it, as well as that journalistic imperative of identifying outliers. A wide variety of stats techniques are often applicable only under a certain conditions, so there is a very real danger of misapplying such techniques. (I may do this myself…) So one thing I think we need to work on are are visual checks and balances, where we might apply visual techniques in tandem to cross-validate the fact that we are applying them appropriately.

Amanda Cox made an observation that really struck me regarding the use of visual methods with the intention of throwing charts and graphics away/the importance of sketching (3-4 mins in to “R at The New York Times Graphics Department, pt. 1” on http://infosthetics.com/archives/2011/12/amanda_cox_talks_about_developing_infographics_at_the_new_york_times_graphics.html). Most of the stuff I produce on this blog I’d class as sketches, snapshots grabbed within a process that I never take as far as the creation of a finished presentation (info)graphic. But then, it’s the process of discovery that intrigues me, the tools and techniques that can be used to help us spot the anomaly or the storyful feature. And then maybe it’s another technique that we need to use to communicate that story.

Enough of that for now… I need to ponder what it means for me to add in a “nofollow” and what the mores of syndicating (auto)link-rich content actually are;-)