Tagged: opencalais

Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags

For what it’s worth, I’ve been looking over some of the programmes that the OU co-produces with the BBC to see what sorts of things we might be able to do in Linked Data space to make appropriate resources usefully discoverable for our students and alumni.

With a flurry of recent activity appearing on the wires relating to the OU Business School Alumni group on LinkedIn, the OU’s involvement with business related programming seemed an appropriate place to start: the repeating Radio 4 series The Bottom Line has a comprehensive archive of previous programmes available via iPlayer, and every so often a Money Programme special turns up on BBC2. Though not an OU/BBC co-pro, In Business also has a comprehensive online archive; this may contain the odd case study nugget that could be useful to an MBA student, so provides a handy way of contrasting how we might reuse “pure” BBC resources compared to the OU/BBC co-pros such as The Bottom Line.

Top tip [via Tom Scott/@derivadow]: do you know about hack where by [http://.bbc.co.uk]/programmes/$string searches programme titles for $string?

So what to do? Here’s a starter for ten: each radio programme page on BBC /programmes seems to have a long, medium and short synposis of the programme as structure data (simply add .json to the end of programme URL to see the JSON representation of the data, .xml for the XML, etc.).

For example, http://www.bbc.co.uk/programmes/b00vy3l1 maps on to http://www.bbc.co.uk/programmes/b00vy3l1/json and http://www.bbc.co.uk/programmes/b00vy3l1.xml

Looking through the programme descriptions for The Bottom Line, they all seem to mention the names and corporate affiliations of that week’s panel members, along with occasional references to other companies. As the list of company names is to all intents and purposes a controlled vocabulary, and given that personal names are often identifiable from their syntactic structure, it’s no surprise that one of the best developed fields for automated term extraction and semantic tagging is business related literature. Which means that there are services out there that should be good at finessing/extracting high quality metadata from things like the programme descriptions for The Bottom Line

The one I opted for was Reuters OpenCalais, simply because I’ve been meaning to play with this service for ages. To get a feel for what it can do, try pasting a block of text into this OpenCalais playground: OpenCalais Viewer demo

If you look at the extracted tags in the left hand sidebar, you’ll see personal names and company names have been extracted, as well as the names of people and their corporate position.

Here’s a quick script to grab the data from Open Calais (free API key required) using the Python-Calais library:

from calais import Calais
import simplejson
import urllib
from xml.dom import minidom

calaisKey=CALAIS_KEY
calais = Calais(calaisKey, submitter="python-calais ouseful")

url='http://www.bbc.co.uk/programmes/b00vrxx0.xml'
dom=minidom.parse(urllib.urlopen(url))

desc=dom.getElementsByTagName('long_synopsis')[0].childNodes[0].nodeValue

print desc

result = calais.analyze(desc)

result.print_entities()
result.print_relations() 

print result.entities
print result.simplified_response

(I really need to find a better way of parsing XML in Python…what should I be using..? Or I guess I could have just grabbed the JSON version of the BBC programme page?!)

That’s step one, then: grabbing a long synopsis from a BBC radio programme /programmes page, and running it through the OpenCalais tagging service. The next step is to run all the programmes through the tagger, and then have a play. A couple of things come to mind for starters – building a navigation scheme that lets you discover programmes by company name, or sector; and a network map looking at the co-occurrence of companies on different programmes just because…;-)

See also: Linked Data Without the SPARQL – OU/BBC Programmes on iPlayer