Tagged: semantic tagging

Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags

For what it’s worth, I’ve been looking over some of the programmes that the OU co-produces with the BBC to see what sorts of things we might be able to do in Linked Data space to make appropriate resources usefully discoverable for our students and alumni.

With a flurry of recent activity appearing on the wires relating to the OU Business School Alumni group on LinkedIn, the OU’s involvement with business related programming seemed an appropriate place to start: the repeating Radio 4 series The Bottom Line has a comprehensive archive of previous programmes available via iPlayer, and every so often a Money Programme special turns up on BBC2. Though not an OU/BBC co-pro, In Business also has a comprehensive online archive; this may contain the odd case study nugget that could be useful to an MBA student, so provides a handy way of contrasting how we might reuse “pure” BBC resources compared to the OU/BBC co-pros such as The Bottom Line.

Top tip [via Tom Scott/@derivadow]: do you know about hack where by [http://.bbc.co.uk]/programmes/$string searches programme titles for $string?

So what to do? Here’s a starter for ten: each radio programme page on BBC /programmes seems to have a long, medium and short synposis of the programme as structure data (simply add .json to the end of programme URL to see the JSON representation of the data, .xml for the XML, etc.).

For example, http://www.bbc.co.uk/programmes/b00vy3l1 maps on to http://www.bbc.co.uk/programmes/b00vy3l1/json and http://www.bbc.co.uk/programmes/b00vy3l1.xml

Looking through the programme descriptions for The Bottom Line, they all seem to mention the names and corporate affiliations of that week’s panel members, along with occasional references to other companies. As the list of company names is to all intents and purposes a controlled vocabulary, and given that personal names are often identifiable from their syntactic structure, it’s no surprise that one of the best developed fields for automated term extraction and semantic tagging is business related literature. Which means that there are services out there that should be good at finessing/extracting high quality metadata from things like the programme descriptions for The Bottom Line

The one I opted for was Reuters OpenCalais, simply because I’ve been meaning to play with this service for ages. To get a feel for what it can do, try pasting a block of text into this OpenCalais playground: OpenCalais Viewer demo

If you look at the extracted tags in the left hand sidebar, you’ll see personal names and company names have been extracted, as well as the names of people and their corporate position.

Here’s a quick script to grab the data from Open Calais (free API key required) using the Python-Calais library:

from calais import Calais
import simplejson
import urllib
from xml.dom import minidom

calaisKey=CALAIS_KEY
calais = Calais(calaisKey, submitter="python-calais ouseful")

url='http://www.bbc.co.uk/programmes/b00vrxx0.xml'
dom=minidom.parse(urllib.urlopen(url))

desc=dom.getElementsByTagName('long_synopsis')[0].childNodes[0].nodeValue

print desc

result = calais.analyze(desc)

result.print_entities()
result.print_relations() 

print result.entities
print result.simplified_response

(I really need to find a better way of parsing XML in Python…what should I be using..? Or I guess I could have just grabbed the JSON version of the BBC programme page?!)

That’s step one, then: grabbing a long synopsis from a BBC radio programme /programmes page, and running it through the OpenCalais tagging service. The next step is to run all the programmes through the tagger, and then have a play. A couple of things come to mind for starters – building a navigation scheme that lets you discover programmes by company name, or sector; and a network map looking at the co-occurrence of companies on different programmes just because…;-)

See also: Linked Data Without the SPARQL – OU/BBC Programmes on iPlayer

Where is the Open University Homepage?

Several weeks ago, I was listening to one of the programmes delivered to me every week via my subscription to the IT Cnversations podcast feed, when I came across this Technometria episode on Search Engine Marketing (if you have a daily commute, it’s well worth listening to on one of your trips this week…).

One of the comments that resonated quite strongly with me, in part because I’ve heard several people in the OU comms team asking several times over the last few months “what’s the point of the OU homepage?”, was that to all intents and purposes, Google is now the de facto homepage from many institutions.

That is, this is the OU homepage for many people:


rather than this:

(As far as I know, very little of our online marketing sends traffic to the homepage – most campaigns send traffic to a URL deeper in the site more relevant to the particular campaign).

Just in passing, a post on Google Blogoscoped today – What Do People Seaarch For? – picked up on an item from Search Engine Land describing a new tool from Google: Search based Keyword Tool.

What this tool does is to “suggest keywords based on actual Google search queries” that are “matched to specific pages of your website”:

Hmmm…. (and yes, that Savings Interest Rates pages is on an OU domain…)

PS this search based keyword tool is also in the ball park of Google Trends, Google Insights for Search, and Google Trends for websites, which I’ve be playing with a lot recently (e.g. Playing with Google Search Data Trends and Recession, What Recession?), as well as the Google Adwords keywords tool:

which looks a lot more reasonable than the Search based Keyword tool?!

PPS Again in passing, and something I intend to pick up on a little more in a later post, Yahoo have just opened up a Key Terms service as part of the BOSS platform that will let you see the keywords that Yahoo has used to index a particular web page (Key Terms provide “an ordered terminological representation of what a document is about. The ordering of terms is based on each term’s frequency and its positional and contextual heuristics.”).

Services like Reuters’ OpenCalais already allow you to do ‘semantic tagging’ of free text, and Yahoo’s Term Extraction service also extracts keywords from text. I’m not sure how the BOSS exposed keywords compare with the keywords identified by the Term Extraction service as applied to a particular web page?

If I get a chance to run some tests, I’ll let you know, unless anyone can provide more info in the meantime?