Although not an OU/BBC co-pro, the “get some academics in to chat to Melvyn” format of BBC Radio 4’s In Our Time means that the OU has, over the years, had a handful of academics appearing on the programme. I’ve been mulling over opportunities for playing with the BBC programmes linked data (no RDF required) I wondered how easy it would be to grab the programmes that OU academics have appeared on. For example, it’s increasingly possible to see programmes associated with particular places (h/t to @gothwin for that; see his post on A Crude BBC Places Linked Data mashup for an application of that data) although the organisations listing is still a bit sparse.
Looking through the programme data, the participants in a programme are listed separately, but not their affiliations. However, in the free text that is used in the long synopsis of the programme, a convention exists to identify the guests, with affiliation or short bio, who appeared on that particular programme:
In the post Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags, I described how the Thomson Reuters’ OpenCalais entity extraction/semantic tagging service could be used to augment the BBC programme data with additional data fields based on analysis of the supplied text. One of the extraction services identifies a set of related fields termed PersonCareer, which detail (where possible) the name of a person, their role, and the organisation they work for. The convention used to list the guests on each programme is appropriate for the extraction of PersonCareer data, at least in some cases.
Rather more reliable is the extraction of University names as Facility data types. What this means is that we can tag each programme with a list of Facilities relating to the universities represented by guests on the programme, and then – where a PersonCareer is extracted, attempt to text match the PersonCareer/Organization name with the extracted Facility name. (Sample code is available here. I had “issues” with character encodings, so there is an element if hackery involved:-( In order to aggregate data from across programmes in the series, I built up a network of programmes and participating institutions using a NetworkX representation, which then gets dumped to output files in a variety of graph formats.)
Here’s an example of the output, filtered to show programmes and programme tags (from the BBC data, rather than Calais extracted tags) that had some sort of association with the Open University:
The above diagram is actually a filtered view over the whole programme’n’university representation network using the Gephi ego network filter:
Node sizing is related to degree in this sub-network, and nodes are coloured according to node type (person, institution, tag, programme.) The graph shows programmes that an OU academic appeared on, and (where possible) which OU academic, by name. Programme tags from the BBC programme data are also shown, as are other institutions that appeared on the same programmes as the OU.
Here’s a snapshot of the full graph – you’ll notice there is some mismatch* in references between the universities mentioned that could possibly be reconciled using a string similarity technique or maybe running the data through Google Refine and using one or more of its string similarity/reconciliation tools.
* things are actually even more pathological: in some cases, I think that Oxbridge Colleges may be identified in PersonCareer metadata as the career organisation, rather than the university affiliation, which may well have been recognised as a Facility. If an organisation identified in a PersonCareer is not one of the Facilities added that has been identified and added to the graph, the organisation is also added. The question we’re left with is: do the errors such as they are make this graph, such as it is, completely use less, or is it better than nothing and something we can work with and improve incrementally as and how we can. [UPDATE: related maybe? Making Linked Data work isn’t the problem]
I’m not sure what the next step should be, but linking the OU ego-graph into the OU Linked Data would be one way forward. For example, displaying papers in ORO authored by appearing academics, or trying to relate programmes to related courses on the OU course catalogue (or even though not indexed in the OU Linked Data store, courses on OpenLearn). A big problem with brokering the Linked Data connections is that I’d have to do free text/regular expression searches on the OU Linked Data store using terms from the BBC/OpenCalais data. THat is, there are no common unique identifier/URIs that can be used as “proper” linking terms:-(