From Linked Data to Linked Applications?

Pondering how to put together some Docker IPython magic for running arbitrary command line functions in arbitrary docker containers (this is as far as I’ve got so far), I think the commands must include a couple of things:

  1. the name of the container (perhaps rooted in a particular repository): psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example;
  2. the actual command to be called: for example, one of the contentine commands: getpapers -q {QUERY} -o {OUTPUTDIR} -x

We might also optionally specify mount directories with the calling and called containers, using a conventional default otherwise.

This got me thinking that the called functions might be viewed as operating in a namespace (psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example). And this in turn got me thinking about “big-L, big-A” Linked Applications.

According to Tim Berners Lee’s four rules of Linked Data, the web of data should:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

So how about a web of containerised applications, that would:

  1. Use URIs as names for container images
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information (in the minimal case, this corresponds to a Dockerhub page for example; in a user-centric world, this could just return a help file identifying the commands available in the container, along with help for individual commands; )
  4. Include a Dockerfile. so that they can discover what the application is built from (also may link to other Dockerfiles).

Compared with Linked Data, where the idea is about relating data items one to another, the identifying HTTP URI actually represents the ability to make a call into a functional, execution space. Linkage into the world of linked web resources might be provided through Linked Data relations that specify that a particular resource was generated from an instance of a Linked Application or that the resource can be manipulated by an instance of a particular application.

So for example, files linked to on the web might have a relation that identifies the filetype, and the filetype is linked by another relation that says it can be opened in a particular linked application. Another file might link to a description of the workflow that created it, and the individual steps in the workflow might link to function/command identifiers that are linked to linked applications through relations that associate particular functions with a particular linked application.

Workflows may be defined generically, and then instantiated within a particular experiment. So for example: load file with particular properties, run FFT on particular columns, save output file becomes instantiated within a particular run of an experiment as load file with this URI, run the FFT command from this linked application on particular columns, save output file with this URI.

Hmm… thinks.. there is a huge amount of work already done in the area of automated workflows and workflow execution frameworks/environments for scientific computing. So this is presumably already largely solved? For example, Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker, C. Zheng & D. Thain, 2015 [PDF]?

A handful of other quick points:

  • the model I’m exploring in the Docker magic context is essentially stateless/serverless computing approach, where a commandline container is created on demand and treated in a disposable way to just run a particular function before being destroyed; (see also the OpenAPI approach).
  • The Linked Application notion extends to other containerised applications, such as ones that expose an HTML user interface over http that can be accessed via a browser. In such cases, things like WSDL (or WADL; remember WADL?) provided a machine readable formalised way of describing functional resource availability.
  • In the sense that commandline containerised Linked Applications are actually services, we can also think about web services publishing an http API in a similar way?
  • services such as Sandstorm, which have the notion of self-running containerised documents, have the potentially to actually bind a specific document within an interactive execution environment for that document.

Hmmm… so how much nonsense is all of the above, then?

Handling RDF on Your Own System – Quick Start

One of the things that I think tends towards being a bit of an elephant in the Linked Data room is the practically difficulty of running a query that links together results from two different datastores, even if they share common identifiers. The solution – at the moment at least – seems to require grabbing a dump of both datastores, uploading them to a common datastore and then querying that…

…which means you need to run your own triple store…

This quick post links out the the work of two others, as much as a placeholder for myself as for anything, describing how to get started doing exactly that…

First up, John Goodwin, aka @gothwin, (a go to person if you ever have dealings with the Ordnance Survey Linked Data) on How can I use the Ordnance Survey Linked Data: a python rdflib example. As John describes it:

[T]his post shows how you just need rdflib and Python to build a simple linked data mashup – no separate triplestore is required! RDF is loaded into a Graph. Triples in this Graph reference postcode URIs. These URIs are de-referenced and the RDF behind them is loaded into the Graph. We have now enhanced the data in the Graph with local authority area information. So as well as knowing the postcode of the organisations taking part in certain projects we now also know which local authority area they are in. Job done! We can now analyse funding data at the level of postcode, local authority area and (as an exercise for the ready) European region.

Secondly, if you want to run a fully blown triple store on your own localhost, check out this post from Jeni Tennison, aka @jenit, (a go to person if you’re using the data.gov.uk Linked Datastores, or have an interest in the Linked Data JSON API): Getting Started with RDF and SPARQL Using 4store and RDF.rb, which documents how to get started on the following challenges (via Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge):

Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.
Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.
Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.
Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.
Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).

PS it may also be worth checking out these posts from Kingsley Idehen:
SPARQL Guide for the PHP Developer
SPARQL Guide for Python Developer
SPARQL Guide for the Javascript Developer
SPARQL for the Ruby Developer

Augmenting OU/BBC Co-Pro Programme Data With Semantic Tags

For what it’s worth, I’ve been looking over some of the programmes that the OU co-produces with the BBC to see what sorts of things we might be able to do in Linked Data space to make appropriate resources usefully discoverable for our students and alumni.

With a flurry of recent activity appearing on the wires relating to the OU Business School Alumni group on LinkedIn, the OU’s involvement with business related programming seemed an appropriate place to start: the repeating Radio 4 series The Bottom Line has a comprehensive archive of previous programmes available via iPlayer, and every so often a Money Programme special turns up on BBC2. Though not an OU/BBC co-pro, In Business also has a comprehensive online archive; this may contain the odd case study nugget that could be useful to an MBA student, so provides a handy way of contrasting how we might reuse “pure” BBC resources compared to the OU/BBC co-pros such as The Bottom Line.

Top tip [via Tom Scott/@derivadow]: do you know about hack where by [http://.bbc.co.uk]/programmes/$string searches programme titles for $string?

So what to do? Here’s a starter for ten: each radio programme page on BBC /programmes seems to have a long, medium and short synposis of the programme as structure data (simply add .json to the end of programme URL to see the JSON representation of the data, .xml for the XML, etc.).

For example, http://www.bbc.co.uk/programmes/b00vy3l1 maps on to http://www.bbc.co.uk/programmes/b00vy3l1/json and http://www.bbc.co.uk/programmes/b00vy3l1.xml

Looking through the programme descriptions for The Bottom Line, they all seem to mention the names and corporate affiliations of that week’s panel members, along with occasional references to other companies. As the list of company names is to all intents and purposes a controlled vocabulary, and given that personal names are often identifiable from their syntactic structure, it’s no surprise that one of the best developed fields for automated term extraction and semantic tagging is business related literature. Which means that there are services out there that should be good at finessing/extracting high quality metadata from things like the programme descriptions for The Bottom Line

The one I opted for was Reuters OpenCalais, simply because I’ve been meaning to play with this service for ages. To get a feel for what it can do, try pasting a block of text into this OpenCalais playground: OpenCalais Viewer demo

If you look at the extracted tags in the left hand sidebar, you’ll see personal names and company names have been extracted, as well as the names of people and their corporate position.

Here’s a quick script to grab the data from Open Calais (free API key required) using the Python-Calais library:

from calais import Calais
import simplejson
import urllib
from xml.dom import minidom

calaisKey=CALAIS_KEY
calais = Calais(calaisKey, submitter="python-calais ouseful")

url='http://www.bbc.co.uk/programmes/b00vrxx0.xml'
dom=minidom.parse(urllib.urlopen(url))

desc=dom.getElementsByTagName('long_synopsis')[0].childNodes[0].nodeValue

print desc

result = calais.analyze(desc)

result.print_entities()
result.print_relations() 

print result.entities
print result.simplified_response

(I really need to find a better way of parsing XML in Python…what should I be using..? Or I guess I could have just grabbed the JSON version of the BBC programme page?!)

That’s step one, then: grabbing a long synopsis from a BBC radio programme /programmes page, and running it through the OpenCalais tagging service. The next step is to run all the programmes through the tagger, and then have a play. A couple of things come to mind for starters – building a navigation scheme that lets you discover programmes by company name, or sector; and a network map looking at the co-occurrence of companies on different programmes just because…;-)

See also: Linked Data Without the SPARQL – OU/BBC Programmes on iPlayer

The Problem With Linked Data is That Things Don’t Quite Link Up

A v. quick post this one, because I have other stuff that really needs to be done, but it’s something I want to record as another couple of observations around the practical difficulties of engaging with Linked Data…

Firstly, identifiers for things most of us would probably call councils. The Guardian Datablog has just published data/details of the local council cuts. The associated Datastore Spreadsheet has a column containing council identifiers, as well as the council names:

Datastore spreadsheet - council cuts

Adding formal identifiers such as these is something I keep hassling Simon Rogers and the @datastore team about, so it’s great to see the use of a presumably standardised identifier there:-) Only – I can’t see how to join it up to any of the other myriad identifiers that seem to exist for council areas?

So for example, looking up Trafford on the National Statistics Linked Data endpoint identifies it as local-authority-district/00BU and Local education authority 358 – I can’t find R342 anywhere? Nor does R342 appear as an identifier on the OpenlyLocal page for Trafford Council, which is another default place I go to look at for bridging/linking information (but then, maybe a local authority is not a council?)

(A use case for the data might be taking the codes and using them to colour areas on an Ordnance Survey OpenSpace map (ans. 1.17)… This requires a bridge into the namespaces the OS mapping tools recognise.)

I can google “Trafford R342” and find a couple of other references to this association, but I can’t find a way of linking to entities I know about in the Linked Data world?

But then, maybe the R*** areas don’t match any of the administrative areas that are recorded in any of the other data soruces I found…?

So I have an identifier, but I don’t know what it actually refers to/links to, and I donlt know how to make use of it?

And then there’s a second related problem – a mismatch between popular understanding of a term/concept, and it’s formal use in a defined ontology, which can cause all sorts of problems when naively trying to make use of formally defined data…

Take for example, the case of counties. Following a brief Twitter exchange this morning with the ever helpful @gothwin, it turns out that if you live in somewere like Southampton (or another unitary authority or metropolitan district), you don’t live in a county… (for example – compare the Ordnance Survey pages for postcode areas SO16 4GU and EX1 1HD). The notion of counties is apparently just a folk convention now, although the Association of British Counties is trying to “promote awareness of the continuing importance of the 86 historic (or traditional) Counties of Great Britain… contend[ing] that Britain needs a fixed popular geography, one divorced from the ever changing names and areas of local government but, instead, one rooted in history, public understanding and commonly held notions of cultural identity.” Which is why they “seek to fully re-establish the use of the Counties as the standard popular geographical reference frame of Britain and to further encourage their use as a basis for social, sporting and cultural activities”. (@gothwin did hint that OS might be “look[ing] at publishing a ‘people’s geography’ with traditional counties”.

As it is, for a naive developer, (or random tinkerer, such as myself), struggling to get to grips with the mechanics of Linked Data, it seems that to make any use at all of government Linked Data, you also need a pretty good grasp of the data models before you randomly try hacking together queries or linking stuff together, as the nightmare exposure I had to COINS Linked Data suggests… ;-)

In other words, there are at least two major barriers to entry to using government Linked Data: on the one hand, there’s getting comfortable enough with things like SPARQL to be able to navigate Linked Data datasets and put together sensible queries (the technical problem); on the other hand, there’s understanding the data model and the things it models well enough to articulate even natural language questions that might be asked of a dataset (a domain expertise problem). (And as we try to link across datasets, the domain expertise problem just compounds?) Then all that remains is mapping the natural language query onto the formal query, given the definitions of ontologies being used…

(I know, I know – it’s always rash to query data you don’t understand… but I think a point I’m trying to make is that getting your head round Linked Data is made doubly difficult when things don’t work not because of the way you’ve written the query, but because you don’t understand the way the data has been modeled… (which ends up meaning it is a problem with the way you wrote the query, just not the way you thought…!))

data.open.ac.uk Linked Data Now Exposing Module Information

As HE becomes more and more corporatised, I suspect we’re going to see online supermarkets appearing that help you identify – and register on – degree courses in exchange for an affiliate/referral fee from the university concerned. For those sites to appear, they’ll need access to course catalogues, of course. UCAS currently holds the most comprehensive one that I know of, but it’s a pain to scrape and all but useless as a datasource. But if the universities publish course catalogue information themselves in a clean way (and ideally, a standardised way), it shouldn’t be too hard to construct aggregation sites ourselves…

So it was encouraging to see earlier this week an announcement that the OU’s data.open.ac.uk site has started publishing module data from the course catalogue – that is, data about the modules (as we now call them – they used to be called courses) that you can study with the OU.

The data includes various bits of administrative information about each module, the territories it can be studied in, and (most importantly?!) pricing information;-)

data.open.ac.uk - module data

You may remember that the data.open.ac.uk site itself launched a few weeks ago with the release of Linked Data sets including data about deposits in the open repository, as well as OU podcasts on iTunes (data.open.ac.uk Arrives, With Linked Data Goodness. Where podcasts are associated with a course, the magic of Linked Data means that we can easily get to the podcasts via the course/module identifier:

data.open.ac.uk

It’s also possible to find modules that bear an isSimilarTo relation to the current module, where isSimilarTo means (I think?) “was also studied by students taking this module”.

As an example of how to get at the data, here’s a Python script using the Python YQL library that lets me run a SPARQL query over the data.open.ac.uk course module data (the code includes a couple of example queries):

import yql

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

endpoint='http://data.open.ac.uk/query'

# This query finds the identifiers of postgraduate technology courses that are similar to each other
q1='''
select distinct ?x ?z from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z
} limit 10
'''

# This query finds the names and course codes of 
# postgraduate technology courses that are similar to each other
q2='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

# This query finds the names and course codes of 
# postgraduate courses that are similar to each other
q3='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

result=run_sparql_query(q3, endpoint)

for row in result.rows:
	for r in row['result']:
		print r

I’m not sure what purposes we can put any of this data to yet, but for starters I wondered just how connected the various postgraduate courses are based on the isSimilarTo relation. Using q3 from the code above, I generated a Gephi GDF/network file using the following snippet:

# Generate a Gephi GDF file showing connections between 
# modules that are similar to each other
fname='out2.gdf'
f=open(fname,'w')

f.write('nodedef> name VARCHAR, label VARCHAR, title VARCHAR\n')
ccodes=[]
for row in result.rows:
	for r in row['result']:
		if r['code1']['value'] not in ccodes:
			ccodes.append(r['code1']['value'])
			f.write(r['code1']['value']+','+r['code1']['value']+',"'+r['name1']['value']+'"\n')
		if r['code2']['value'] not in ccodes:
			ccodes.append(r['code2']['value'])
			f.write(r['code2']['value']+','+r['code2']['value']+',"'+r['name2']['value']+'"\n')
		
f.write('edgedef> c1 VARCHAR, c2 VARCHAR\n')
for row in result.rows:
	for r in row['result']:
		#print r
		f.write(r['code1']['value']+','+r['code2']['value']+'\n')

f.close()

to produce the following graph. (Size is out degree, colour is in degree. Edges go from ?x to ?z. Layout: Fruchterman Reingold, followed by Expansion.)

OU postgrad courses in gephi

The layout style is a force directed algorithm, which in this case has had the effect of picking out various clusters of highly connected courses (so for example, the E courses are clustered together, as are the M courses, B courses, T courses and so on.)

If we run the ego filter over this network on a particular module code, we can see which modules were studying alongside it:

ego filter on course codes

Note that in the above diagram, the nodes are sized/coloured according to in-degree/out-degree in the original, complete graph, If we re-calculate those measures on just this partition, we get the following:

Recoloured course network

If we return to the whole network, and run the Modularity class statistic, we can identify several different course clusters:

Modules - modularity class

Here’s one of them expanded:

A module cluster

Here are some more:

COurse clusters

I’m not sure what use any of this is, but if nothing else, it shows there’s structure in that data (which is exactly what we’d expect, right?;-)

PS as to how I wrote my first query on this data, I copied the ‘postgraduate modules in computing’ example query from data.open.ac.uk:

http://data.open.ac.uk/query?query=select%20distinct%20%3Fx%20from%20%3Chttp://data.open.ac.uk/context/course%3Ewhere%20{%3Fx%20a%20%3Chttp://purl.org/vocab/aiiso/schema%23Module%3E.%0A%3Fx%20%3Chttp://data.open.ac.uk/saou/ontology%23courseLevel%3E%20%3Chttp://data.open.ac.uk/saou/ontology%23postgraduate%3E.%0A%3Fx%20%3Chttp://purl.org/dc/terms/subject%3E%20%3Chttp://data.open.ac.uk/topic/computing%3E%0A}%0A&limit=200

and pasted it into a tool that “unescapes” encoded URLs, which encodes the SPARQL query:

Unescaping text

I was then able to pull out the example query:
select distinct ?x from <http://data.open.ac.uk/context/course&gt;
where {?x a <http://purl.org/vocab/aiiso/schema#Module&gt;.
?x <http://data.open.ac.uk/saou/ontology#courseLevel&gt; <http://data.open.ac.uk/saou/ontology#postgraduate&gt;.
?x <http://purl.org/dc/terms/subject&gt; <http://data.open.ac.uk/topic/computing&gt;
}

Just by the by, there’s a host of other handy text tools at Text Mechanic.

Accessing Linked Data in Scraperwiki via YQL

A comment from @frabcus earlier today alerted me to the fact that the Scraperwiki team had taken me up on my suggestion that they make the Python YQL library available in the Scraperwiki environment, so I thought I ought to come up with an example of using it…

YQL provides a general purpose standard query interface “to the web”, interfacing with all manner of native APIs and providing a common way of querying with them, and receiving responses from them. YQL is extensible too – If there isn’t a wrapper for your favourite API, you can write one yourself and submit it to the community. (For a good overview of the rationale for, and philosophy behind YQL, see Christian Heilmann’s the Why of YQL.)

Browsing through the various community tables, I found one for handling SPARQL queries. The YQL wrapper expects a SPARQL query and an endpoint URL, and will return the results in the YQL standard form. (Here’s an example SPARQL query in the YQL developer console using the data.gov.uk education datastore.)

The YQL query format is:
select * from sparql.search where query=”YOUR_SPARQL_QUERY” and service=”SPARQL_ENDPOINT_URL”
and can be called in Python YQL in the following way (Python YQL usage):

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

For a couple of weeks now, I’ve been look for an opportunity to try to do something – anything – with the newly released Ordnance Survey Linked Data (read @gothwin’s introduction to it for more details: /location /location /location – exploring Ordnance Survey Linked Data – Part 2).

One of the things the OS Linked Data looks exceedingly good for is acting as glue, mapping between different representations for geographical and organisational areas; the data can also return regions that neighbour on a region, which could make for some interesting “next door to each other” ward, district or county level comparisons.

One of the most obvious ways in to the data is via a postcode. The following Linked Data query to the ordnance survey SPARQL endpoint (http://api.talis.com/stores/ordnance-survey/services/sparql) returns the OS district ID, ward and district name that a postcode exists in:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX postcode: < http://data.ordnancesurvey.co.uk/ontology/postcode/&gt;

select ?district ?wardname ?districtname where { <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;
postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}

Here is is running in the YQL developer console:

OS Posctcode query in YQL developer console

(Just by the by, we can create a query alias for that query if we want, by changing the postcode (MK76AA in the example to @postcode. This gives us a URL argument/variable called postcode whose value gets substituted in to the query whenever we call it:

http://query.yahooapis.com/v1/public/yql/psychemedia/ospostcodelookupdemo1?postcode=INSERTPOSTCODEHERE&env=http://datatables.org/alltables.env

[Note we manually need to add the environment variable &env=http://datatables.org/alltables.env to the URL created by the query alias generator/wizard.]

YQL query alieas for sparql query

So… that’s SPARQL in YQL – but how can we use it in Scraperwiki… The newly added YQL wrapper makes it easy.. here’s an example, based on the above:

os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'

os_query='''
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}
'''
postcode="MK7 6AA"

os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']}
    scraperwiki.datastore.save(["id"], record) 

I use the MAGICPOSTCODE substitution to give me the freedom to create a procedure that will take in a postcode argument and add it in to the query. Note that I am probably breaking all sorts of Linked Data rule by constructing the URL that uniquely identifies (reifies?) the postcode in the ordnance survey URL namespace (that is, I construct something like <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;, which contravenes the “URIs are opaque” rule that some folk advocate, but I’m a pragmatist;-)

Anyway, here’s a Scraperwiki example that scrapes a postcode from a web page, and looks up some of its details via the OS: simple Ordnance Survey Linked Data postcode lookup

The next thing I wanted to do was use two different Linked Data services. Here’s the setting. Suppose I know a postcode, and I want to lookup all the secondary schools in the council area that postcode exists in. How do I do that?

The data.gov.uk education datastore lets you look up schools in a council area given the council ID. Simon Hume gives some example queries to the education datastore here: Using SPARQL & the data.gov.uk school data. The following is a typical example:

prefix sch-ont: <http://education.data.gov.uk/def/school/&gt;

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt; ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation .
}

Here, the secondary schools are being identified according to the district area they are in (00MG in this case).

But all I have is the postcode… Can Linked Data help me get from MK7 6AA to 00MG (or more specifically, from <http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE&gt; to <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt;?)

Here’s what the OS knows about a postcode:

What the OS knows about a postcode

If we click on the District link, we can see what the OS knows about a district:

Local authority area code lookup in OS Linked Data

The Census Code corresponds to the local council id code used in the Education datastore (thanks to John Goodwin for pointing that out…). The identifier doesn’t provide a Linked Data URI, but we can construct one out of the code value:
<http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE&gt;

(Note that the statistics.data.gov.uk lookup on the district code does include a sameas URL link back to the OS identifier.)

Here’s how we can get hold of the district code – it’s the dmingeo:hasCensusCode you’re looking for:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX admingeo: <http://data.ordnancesurvey.co.uk/ontology/admingeo/>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?nsdistrict ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname .
?district admingeo:hasCensusCode ?nsdistrict.
}
'''

postcode='MK7 6AA'
os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'
os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print row['result']['nsdistrict']['value']
    districtcode=row['result']['nsdistrict']['value']
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']} 

So what does that mean… well. we managed to look up the district code from a postcode using the Ordnance Survey API, which means we can insert that code into a lookup on the education datastore to find schools in that council area:

def run_sparql_query(query, endpoint):
    '''
    # The following string replacement construction may be handy
    query = 'select * from flickr.photos.search where text=@text limit 3';
    y.execute(query, {"text": "panda"})
    '''
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

edu_endpoint='http://services.data.gov.uk/education/sparql'    

edu_query='''
prefix sch-ont:  <http://education.data.gov.uk/def/school/>

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE> ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation <http://education.data.gov.uk/def/school/PhaseOfEducation_Secondary>.
}
'''
districtcode='00MG'
edu_query=edu_query.replace('MAGIC_DISTRICTCODE',districtcode)
result=run_sparql_query(edu_query, edu_endpoint)
for row in result.rows:
    for school in row['result']:
        print school['name']['value'],school['reference']['value'],school['date']['value']
        record={ "id":school['reference']['value'],"name":school['name']['value'],"openingDate":school['date']['value']}
        scraperwiki.datastore.save(["id"], record) 
​

Here’s a Scraperwiki example showing the two separate Linked Data calls chained together (click on the “Edit” tab to see the code).

Linked Data in Scraperwiki

Okay – so that easy enough (?!;-). We’ve seen how:
– Scraperwiki supports calls to YQL;
– how to make SPARQL/Linked Data queries from Scraperwiki using YQL;
– how to get data from one Linked Data query and use it in another.

A big problem though is how do you know whether there is a linked data path from a data element in one Linked Data store (e.g. from a postcode lookup in the Ordnance Survey data) through to another datastore (e.g. district area codes in the education datastore), where you is a mere mortal and not a Linked Data guru?! Answers on the back of a postcard, please, or via the comments below;-)

PS whilst doing a little digging around, I came across some geo-referencing guidance on the National Statistcics website that suggests that postcode areas might change over time (they also publish current and previous postcode info). So what do we assume about the status (currency, validity) of the Ordnance Survey postcode data?

PPS Just by the by, this may be useful to folk looking for Linked Data context around local councils: @pezholio’s First steps to councils publishing their own linked data

Linked Data and the Leaders’ Debate – My Challenge

Over the last few weeks, there has a been a smattering of challenges to the Linked Data community baiting them to demonstrate some of the utility of the Linked Data approach (e.g. A Challenge To Linked Data Developers (followed up in Response To My Linked Data Challenge) and Linked Data: my challenge, with some other possibilities here: 10 Ideas For Web of Data Apps).

So here’s my challenge, inspired by the #leadersdebate last night:

.

That is: each party should be required to tweet a link to #datagovuk sparql query to justify every “factual” claim they make.

For example – can someone write me a Linked Data query to show how much is spent by the government on UK education quangoes?

PS doing something similar for any claims made in the manifestos should also count…

So What Is It About Linked Data that Makes it Linked Data™?

If you’ve been to any confrences lately where Linked Data has been on the agenda, you’ll probably have seen the four principles of Linked Data (I grabbed the following from Wikipedia…)

1. Use URIs to identify things.
2. Use HTTP URIs so that these things can be referred to and looked up (“dereference”) by people and user agents.
3. Provide useful information (i.e., a structured description — metadata) about the thing when its URI is dereferenced.
4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.

Wot, no RDF? ;-) (For the original statement of the four rules, see TIm Berners Lee’s Design Issues: Linked Data personal note, which does mention RDF.)

Anyway – here’s my take on what we have… building on my Parliamentary Committees Treemap, I thought I’d do something similar for the US 111st Congress Committees to produce something like this map for the House:

US 111st COngress committees

I reused an algorithm I’d used to produce the UK Parliamentary committee maps:

– grab the list of committees;
– for each committee, grab the membership list for that committee.

That is, I want to annotate one dataset with richer information from another one; I want to link different bits of data together…

The “endpoint” I used to make the queries for the Congress committee map was the New York Time Congress API.

The quickest way (for me) to get the data was to use a couple of Yahoo Pipes. Firstly, here’s one that will get a list of committee members from a 111st Congress House committee given its committee code (it’s left as an exercise for the reader to generalise this pipe to also accept a chamber and congress number arguments ;-)

I get the data using a URL. Here’s what one looks like:
http://api.nytimes.com/svc/politics/v3/us/legislative/congress/111/house/committees/HSAG.xml?api-key=MY_KEY

So given a committee code, can get a list of members. Here’s what a single member’s record looks like:

rank_in_party: 5
name: Neil Abercrombie
begin_date: 2009-01-07
id: A000014
party: D

If I wanted to annotate these details further, there is also a list of House members that return records of the form:

id: A000014
api_uri: http://api.nytimes.com/svc/politics/v3/us/legislative/congress/members/A000014.json
first_name: Neil
middle_name: null
last_name: Abercrombie
party: D
seniority: 22
state: HI
district: 1
missed_votes_pct: 12.81
votes_with_party_pct: 98.27

I can grab a single member record using a URL of the form:
http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/members/{member-id}[.response-format]?api-key=MY_KEY

Now, where can I get a list of committees?

From a URL like this one
http://api.nytimes.com/svc/politics/v3/us/legislative/congress/111/house/committees.xml?api-key=MY_KEY

The data returned has the form:

chair: P000258
url: http://agriculture.house.gov/
name: Committee on Agriculture
id: HSAG

Here’s how I grab the committee listing and then augment each committee with its members:

Although I don’t directly have a identifier in the form of a URL for the membership list of a committee, I know how to generate one given a pattern that will create the URL for a committee ID given a committee ID, and a committee ID. The pattern generalises around the chamber (House or Senate) and Congress number as well:
http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/{congress-number}/{chamber}/committees[/committee-id][.response-format]?api-key=MY_KEY

So I think this counts as linkable data, and we might even call it linked data. If I work within a closed system, like the pipes environment, then using “local” identifiers, such as committee ID, chamber and congress number, I can generate a URL style identifier that works as a web address.

But can we call the above approach a Linked Data™ approach?

1. Use URIs to identify things.
This works for the committee membership lists, the list of committees and individual members, if required.

2. Use HTTP URIs so that these things can be referred to and looked up (“dereference”) by people and user agents.
Almost – at the moment the views are XML or JSON (no human readable HTML), but at least in the committee list there’s a link to a human audience web page.

3. Provide useful information (i.e., a structured description — metadata) about the thing when its URI is dereferenced.
The members’ records are useful, and the committee records do describe the name of the committee, along with it’s identifier. But info that make committee records uniquely identifiable exist “above” the individual committee record (e.g. the congress number and the chamber). In a closed pipes environment, such as the one described above, if we can propagate the context (committee id, chamber, congress number), we can uniquely identify resourceses using dereferencable HTTP URIs (i.e. things that work as web addresses) using a URI pattern and local context.

4. Include links to other, related URIs in the exposed data to improve discovery of other related information on the Web.
Yes, we have some of that…

So, the starter for ten: do we have an example of Linked Data™ here? Note there is no RDF and no SPARQL endpoint exposed to me as a user. But I’ve had to use connective tissue to annotate one HTTP URI identified resource (the committee list) with results from a family of other HTTP URI idnetified resources (the membership lists). I could have gone further and annotated each member record with data from the “member’s info” family of HTTP URIs.

The “top level” pipe is a “linking query”. IF I had constructed it slightly differently, I could have passed in a chamber and congress number and it would have:
– constructed an HTTP URI to look up a list of committees for that chamber in that Congress; (this was a given in the pipe shown above);
– grabbed the list of committees;
– annotated with them with membership lists.

As it is, the pipe contains “assumed” context (the congress number and chamber), as well as the elephant in the room assumption – that I’m making queries on the NYT Congress API.

On reflection, this is perhaps bad practice. The congress number and chamber are hidden assumptions within the pipe. The URL pattern that the NYT Congress API defines explicitly identifies mutable elements/parameters:

http://api.nytimes.com/svc/politics/{version}/us/legislative/congress/{congress-number}/{chamber}/committees[/committee-id][.response-format]?api-key={your-API-key}

Which suggests that maybe best practice would be to pass local context data via user parameters throughout the pipework to guarantee a shared local context within child pipes?

So where am I coming from with all this?

I’m happy to admit that I can see how it’s really handy having universal, unique URIs that resolve to web pages or other web content. But I also think that local identifiers can fulfil the same role if you can guarantee the context as in a Yahoo Pipe or a spreadsheet (e.g. Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)).

So for example, in the OU we have course codes which can play a very powerful role in linking resources together (e.g. OU Course Codes – A Web 2.OU Crown Jewel). I’ve tended to use the phrase “pivot point” to describe the sorts of linking I do around tags, or course codes, or the committee identifiers described in this post and then show how we can use these local or partial identifiers to access resources on other websites that use similar pivot points (or “keys”). (ISBNs are a great one for this, as ISBN Playground shows.)

If Linked Data™ zealots continue to talk about Linked Data solely in terms of RDF and SPARQL, I think they will put a lot of folk who are really excited about the idea of trying to build services across distrubuted (linkable) datasets off… IMVHO, of course…

My name’s Tony Hirst, I like linking things together, but RDF and SPARQL just don’t cut it for me…

PS this is relevant too: Does ‘Linked Data’ need human readable URIs?

PPS Have you taken my poll yet? Getting Started with data.gov.uk… or not…