Picking up on Political Representation on BBC Political Q&A Programmes – Stub , the quickest of hacks…
In OpenRefine, create a new project by importing data from a couple of URLs – data from the BBC detailing episode IDs for Any Questions and Question Time:
Import the data as XML, highlighting a single programme code row as the import element.
The data we get looks like this – /programmes/b007ck8s#programme – so we can add a column by URL around 'http://www.bbc.co.uk'+value.split('#')+'.json' to get JSON data back for each column.
Parse the JSON that comes back using something like value.parseJson()['programme']['medium_synopsis'] to create a new column containing the medium synopsis information.
The medium synopsis elements typically look like Topical debate from Colchester, with David Dimbleby. On the panel are Peter Hain, Sir Menzies Campbell, Francis Maude, singer Beverley Knight and journalist Cristina Odone. Which is to say they often contain the names of the panellists.
We can try to extract the names contained within each synopsis using the Zemanta API (key required) accessed via the Named-Entity Recognition extension for Google Refine / OpenRefine.
These seem to come back in reconciliation API form with the name set to a name and the id to a URL. We can get a concatenated list of the URLs that are returned by creating a column around something like this: forEach(cell.recon.candidates,v,v.id).sort().join('||') but I’m not sure that’s useful.
We can creata a column based just around the matched ID using cell.recon.match.name.
Let’s use the row view and fill down on programme IDs, then have a look at a duplicate facet and view only rows that are duplicated (that is, where an extracted named entity appears more than once). We can also use a text facet to see which names appear in multiple episodes of Question Time and/or Any Questions.
Selecting a single name allows us to see the programmes that person appeared on. If we pull out the time of first broadcast (value.parseJson()['programme']['first_broadcast_date']) and Edit Cells-Common Transforms-To date, we can also use a date facet to select out programmes first broadcast within a particular date range.
We can also run a text filter to limit records to episodes including a particular person and then use the Date facet to highlight the episodes in which they appeared on the timeline:
What this suggests is that we can use OpenRefine as some sort of ‘application shell’ for creating information tools around a particular dataset without actually having to build UI components ourselves?
If we custom export a table using programme IDs and matched names, and then rename the columns Source and Target, we can visualise them in something like Gephi (you can use the recipe described in the second part of this walkthrough: An Introduction to Mapping Company Networks Using Gephi and OpenCorporates, via OpenRefine).
The directed graph we load into Gephi connects entities (participant names, location names) with programme IDs. There is handy tool – Multimode Networks Projection – that can collapse the graph so that entities are connected to other entities that they shared a programme ID with.
(If you forget to remove the programme nodes, a degree range filter to select only nodes with degree greater than 2 tidies the graph up.)
If we run PageRank on the graph (now as an undirected graph), layout using ForceAtlas2 and size nodes according to PageRank, we can look into the heart of the UK political establishment as evidenced by appearances on Question Time and Any Questions.
The next step would probably be to try to pull info about each recognised entity from dbPedia (forEach(cell.recon.candidates,v,v.id).sort() seems to pull out dbpedia URIs) but grabbing data from dbPedia seems to be borked in my version of OpenRefine atm:-(
Anyway – a quick hack that took longer to write up than it did to do…
OpenRefine project file here.