Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL)

A big :-) from me today – at last I think I’ve started to get my head round this mashup malarkey properly… forget the re-presentation stuff, the real power comes from using one information source to enrich another… but as map demos are the sine qua non of mashup demos, I’ll show you what I mean with a map demo…

So to start, here’s a simple query on the Guardian content store API for football match reports:

http://api.guardianapis.com/content/search?
filter=/football&filter=/global/matchreports&after=20090314&api_key=MYSECRETACTIVATEDKEY

It’s easy enough to construct the query URI using a relative date in the Yahoo pipe, so the query will always return the most recent match reports (in this case, matc h reports since “last saturday”):

It’s easy enough to use these results to generate an RSS feed of the most recent match reports:

Pulling the images in as Media RSS (eg media:group) elements means that things like the Google Ajax slide show control and the Pipes previewer can automatically generate a slideshow for you…

You can also get the straight feed of course:

A little bit of tinkering with the creation of the description element means we can bring the original byline and match score in to the description too:

Inspecting the API query results by eye, you might notice that a lot of the bylines have the form “John Doe at the Oojamaflip Stadium”:

Hmmm…

It’s easy enough to exploit this structural pattern to grab the stadium name using a regular expression or two:

I thien did a little experiment running the name of the stadia, and the name of the stadia plius football ground, UK through the Yahoo Location Extractor block to try to plot the sotries on map locations corresponding to the football ground locations, but the results weren’t that good…

…so I tweeted:

And got a couple of responses…

The XQuery/DBpedia with SPARQL – Stadium locations link looked pretty interesting, so I tweaked the example query on that page to return a list of English football stadia and their locations:

PREFIX p: <http://dbpedia.org/property/&gt;
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
SELECT * WHERE
{?ground skos:subject <http://dbpedia.org/resource/Category:Football_venues_in_England&gt;.
?ground geo:long ?long.
?ground geo:lat ?lat.
?ground rdfs:label ?groundname.
FILTER (lang(?groundname) ='en').
}

and created a pipe to call dBpedia with that query (dbpedia example – English football stadium location lookup pipe):

Because I don’t know how to write SPARQL, I wasn’t sure how to tweak the query to just return the record for a given stadium name (feel tfree to comment telling me how ;-) – so instead I used a pipe filter block to filter the results instead. (This combination of search and filter can be a very powerful one when you don’t know how to phrase a particular qusry, or when a query language doesn’t support a search limit you want…

It was now a simple matter to add this pipe in to geocode the locations of the appropriate stadium for each match report:

So let’s recap – we call the Guardian content API for match reports since “last saturday” and construct a nice RSS feed from it, with description text that includes the byline and match score, as well as the match report. Then we pull out the name of stadium each match was played at (relying on the convention that seems to work much of the time that the byline records the stadium) and pass it through another pipe that asks DBpedia for a list of UK football stadium locations, and then filters out the one we want.

Tweak the location data to a form Yahoo pipes likes (which means it will create a nice geoRSS or KML feed for us) and what do we get? Map based match reports:

As I’ve show in this blog many times before, it’s easy enough to grab a KML feed from the More options pipe output and view the results elsewhere:

(Click on a marker on the google map and it will pop up the match report.)

So what do we learn from this? Hmmm – that I need to learn to speak SPARQL, maybe?!

PS @kitwallace has come up trumps with a tweak to the SPARQL query that will do the query by stadium name in one:
FILTER (lang(?groundname) =’en’ && regex(?groundname,’Old Trafford’)). Ta, muchly :-)

Getting Started with data.gov.uk, Triplr SPARYQL and Yahoo Pipes

RDF, SPARQL and the semantic web are too scarey for mortals, right? So here’s a hopefully easier way in for those of us who are put off by the syntactic nightmare that defines the world of formal Linked Data: Yahoo Pipes :-)

A few weeks ago, I was fortunate enough to get one of the developer preview keys for data.gov.uk, the UK answer to Tim Berners Lee’s call to open up public data.

The UK data.gov.uk solution is currently a hybrid – a growing set of separate Linked Data stores (hosted on the Talis Platform, I think?) covering areas such as education, finance and transport; and a set of links to CSV and Excel spreadsheets available for download on a wide variety of Government department websites. (Some of them have nice URLs, some don’t; if you think they should, how does this sound? Designing URI Sets for the UK Public Sector: Machine- and human-readable formats. Most of the data is public too – it’s just the meta-sit – data.gov.uk – that I think is under wraps at the moment?)

I’ve played with online CSV and Excel spreadsheets before (e.g. in the context of the Guardian Datastore), but I’ve always found SPARQL endpoints and RDF a little bit, err, terrifying, so last week I felt it was time to bite the bullet and spend an hour or two trying to do something – anything – with some hardcore Linked Data. Or at least, try to just do something – anything – with some data out of a single data store.

So where to start? Regular readers will know that I try to use free online apps and client side Javascript code wherever possible (I don’t want to have to assume the availability of access to my own web server), so it made sense to look to at Yahoo Pipes :-)

I’ve done the odd demo of how to use SPARQL in a Yahoo Pipe before (Last Week’s Football Reports from the Guardian Content Store API (with a little dash of SPARQL), which is not about the football, right?) but a tweet last week tipped me off to a potentially more abstracted way of writing SPARQL queries in a Pipes environment: Triplr’s SPARQL + YQL = SPARYQL.

Ooh…. the idea is that you can wrap a SPARQL query in a YQL query, which in turn suggests two other things…

Firstly, I now have a way I’m already familiar with of generating and debugging bookmarkable RESTful queries to SPARQL endpoints:

Secondly, the Yahoo Pipes YQL block provides a handy container for making the SPARYQL queries and pulling the results back into a pipes environment.

Here’s the pipework… (based on an original pipe by @hapdaniel)

SPARQL + YQL= SPARYQL pipe http://pipes.yahoo.com/ouseful/sparyql

So what can we do with it? Not being particularly fluent in SPARQL, I had a poke around for some examples I could cut, paste, hack and tinker with and found a few nice examples on the [n]^2 blog: SPARQLing data.gov.uk: Edubase Data

So here’s a quick demo – a pipe runs a query that looks for the 10 schools with the latest opening dates on data.gov.uk’s education datastore:

SELECT ?school ?name ?date ?easting ?northing WHERE {?school a sch-ont:School; sch-ont:establishmentName ?name; sch-ont:openDate ?date ; sch-ont:easting ?easting; sch-ont:northing ?northing . } ORDER BY DESC(?date) LIMIT 10

In order to plot the schools on a map, it’s necessary to convert northings and easting to latitude and longitude. A cry for help on twitter was quickly responded to by @kitwallace who gave me a link to a service that did just the job… almost – for some reason, Pipes didn’t like the output, so I had to run the query through a YQL proxy:

(Note that Kit soon came up with a fix, so I could actually just call the service directly via a Data Source block using a call of the form http://www.cems.uwe.ac.uk/xmlwiki/geo/OS2latlong2.xq?easting=527085&northing=185400.)

Here’s the result:

Note that the KML output from the pipe can be plotted directly in a Google map (simply paste the KML URL into Google Map search box and hit return.)

By writing two or three different queries, and pulling the data separately into a web page via the JSON feed, we can easily create a map that displays the schools that have opened and closed between 1/1/08 and 1/10/09:

If we take the CSV output of the pipe, we can also see how it’s possible to transport the content into a Google spreadsheet (once again thanks to @hapdaniel for pointing out that changing the output switch of a pipe’s RSS feed from rss to csv does the format conversion job nicely):

which gives:

(Note that the CSV import seems to require quite a flat data structure (though it is trying really hard with the more hierarchical data – it’s just not quite managing to catch the data values?), so some renaming within the pipe might be required to make sure that the child attributes of each feed item do not have any children of their own. Empty attributes also need pruning.)

PS I did try importing the XML output from a RESTful YQL query into a Google spreadsheet with an =importXML formula but it didn’t seem to work. Firstly, the RESTful URI was too long (easily solved by rewriting it as a shortenedURI). Secondly, the Google spreadsheet didn’t seem to like output XML :-(

So near, yet so far… but still, it poses the question: could we write containerised queries/topic specific APIs over data.gov.uk SPARQL endpoints that expose the results in a spreadsheet capable of importing XML?

Mapping Recent School Openings and Closures

Just after I put together the pipework for Getting Started with data.gov.uk, Triplr SPARYQL and Yahoo Pipes, I also cut and pasted some of the code from a previous map based mashup to demo how to make a SPARQL call via a pipe that calls on the UK Gov education Linked Data datastore from within a web page, and then display the geocoded results on a map.

Here’s the demo – School openings and closures in the UK, 1/1/08-1/10/09

If you View Source, you’ll see the code boils down to:

//schools closed between 1/1/08 and 1/10/09
q="SELECT ?school ?name ?opendate ?closedate ?easting ?northing WHERE {?school a sch-ont:School;  sch-ont:establishmentName ?name;sch-ont:easting ?easting; sch-ont:northing ?northing; sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ; sch-ont:closeDate ?closedate ; sch-ont:openDate ?opendate . FILTER (?closedate > '2008-01-01'^^xsd:date && ?closedate < '2009-10-01'^^xsd:date)}"

u=ur+encodeURIComponent(q);
getPipeGeoData(u, 'parseJSON_purple');

In all I make three calls to a pipe that calls on the data.gov.uk education datastore, one for schools opened between 1/1/08 and 1/10/09:
SELECT ?school ?name ?opendate ?easting ?northing WHERE {?school a sch-ont:School; sch-ont:establishmentName ?name;sch-ont:easting ?easting; sch-ont:northing ?northing; sch-ont:openDate ?opendate . FILTER (?opendate > '2008-01-01'^^xsd:date && ?opendate < '2009-10-01'^^xsd:date)}

one for schools closed between 1/1/08 and 1/10/09:
SELECT ?school ?name ?opendate ?closedate ?easting ?northing WHERE {?school a sch-ont:School; sch-ont:establishmentName ?name;sch-ont:easting ?easting; sch-ont:northing ?northing; sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Closed ; sch-ont:closeDate ?closedate ; sch-ont:openDate ?opendate . FILTER (?closedate > '2008-01-01'^^xsd:date && ?closedate < '2009-10-01'^^xsd:date)}

and one for schools proposed to close:
SELECT ?school ?name ?easting ?northing ?opendate WHERE {?school a sch-ont:School; sch-ont:establishmentName ?name;sch-ont:easting ?easting; sch-ont:northing ?northing ; sch-ont:establishmentStatus sch-ont:EstablishmentStatus_Open__but_proposed_to_close; sch-ont:openDate ?opendate . }

(I cribbed how to write these queries from a Talis blog: SPARQLing data.gov.uk: Edubase Data;-)

The results of each call are displayed using the different coloured markers.

(The rest of the code is really horrible. Note to self: get round to learning JQuery.)

Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet

This is a stub post as much as anything to help me keep tabs on a technique I’ve not had any time to properly play with, let alone document: consuming Linked Data in a Google spreadsheet.

First up – why might this be useful? In short, I think that many people who might want to make use of data.gov.uk data are probably comfortable with spreadsheets but not with code, so giving them access to SPARQL and RDF is not necessarily useful.

So whilst the recipe shown here is a hacky one, at least it opens up the playground a little to explore what the issues might be – and what things might be facilitated by – providing fluid, live data routes from RDF triple stores into spreadsheets.

So here’s what I’m aiming for – some data from the education datastore in a spreadsheet:

And how did it get there? Via a CSV import using the =importData formula.

And where did the CSV come from? The sparqlproxy webservice:

As well as giving CSV output, the serivce can also gnerate HTML and a variety of JSON formats, including the MIT Simile and Googl Viz API formats (which means it’s easy to just plug other data into a wide variety of visualisation formats.

To get the data into a Google spreadsheet, simply copy the CSV URI into an =importData(“CSV_URI_HERE”) formula in a spreadsheet cell.

The sparqlproxy service can also pull in and transform queries that have been posted on the web:

So for example, in the above case the query at http://data-gov.tw.rpi.edu/sparql/stat_ten_triples_default.sparql looks like:

What this means is that someone else can write complex queries and mortals can then access the data and display it however they want. (What I’d really like to see is a social site that supports the sharing of endpoint/query pairs for particular queries (I could probably even hack something to do this using delicious?) ;-)

Once the data is in the spreadsheet, it can be played with in the normal way of course. So for example, I can query the spreadsheet using my old prototype Guardian datastore explorer:

In the above example, the route is then:

1) a sparql query onto http://services.data.gov.uk/education/sparql is run through
2) the http://data-gov.tw.rpi.edu/ws/sparqlproxy.php sparqlproxy service to produce a CSV output that is
3) pulled into a Google spreadsheet using an =importData() formula, and in turn
4) queried using my Google Datastore explorer using the Google visualisation API query language and then
5) rendered in a Google Visualisation table widget.

Lurvely… :-)

Programming Pipes With Delicious and Sharing data.gov.uk SPARQL Queries As A Result

In the post Sharing Linked Data Queries With Mortals… I described a rather clunky pattern for sharing SPARQL queries onto a data.gov.uk sparql endpoint using delicious, along with a Yahoo pipe to generate a SPARQLProxy URI that identified a CSV formatted output from the query that could be consumed in something like Google spreadsheets (e.g. Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet).

In this post, I’ll refine that pattern a little more and show how to use delicious to bookmark a “processed” form of the output of the query, along with all the ingredients needed to generate that output. In a later post (hopefully before Christmas) I’ll try to show how the pattern can be used to share queries into other datastores, such as Google visualization API queries into a Google spreadsheet.

[The post describes two independent operations – firstly, how to generate a sparqlproxy uri from a set of delicious tags; secondly, how to generate a map from sparqlproxy output.]

In what follows, I’m using delicious as a database, and delicious tags as machine tags that can be used as arguments within a Yahoo pipe. The intention is not to suggest that this is even a good way of sharing SPARQL queries and demo outputs, but it does show an improvised way of how to share them,. It also provides just enough raw material to allow UI designers to think how we might capture, share and display sample use cases that go from SPARQL query to human meaningful output.

So here’s what a “finished” delicious bookmark looks like:

The three “machine tags”:
– endpoint:http://services.data.gov.uk/transport/sparql
– output:csv
– query:http://codepad.org/hPo2XIzx/raw.txt
are used as instructions in a Yahoo pipe that generates a query using SPARQLProxy:

A feed of bookmarked queries is pulled in from delicious, and checked to see that all the programming arguments that are required are there. (The sparlproxy_demo tag requirement is just a convenience.)

If all the arguments are available, their values are captured and passed to a routine to construct the SPARQLProxy URIs:

The query bookmarked by the tags is a query onto the transport database that pulls out the location traffic monitoring points on thee M5. The bookmarked URI is a demonstration of how to use the output of that query. In the current example, the bookmarked demo URI looks like this:

http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=
&q=http:%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3D1c77faa95919df9dbea678a6bf4881c6%26_render%3Dkml
&sll=37.0625,-95.677068&sspn=44.25371,70.751953&ie=UTF8&z=8

That is, it is a bookmark to a Google map that is displaying a KML feed pulled in from a Yahoo pipe.

This is the Google Map:

This is the Yahoo pipe the KML is pulled from:

And this is a peek inside that pipe:

The URI for the fetched CSV file is on that was generated from the bookmarked query tags by the first pipe shown above.

So to recap – this bookmark:

links through to a Google map showing traffic monitoring locations on the M5. The map is an ‘end-user’ demo that shows how Government data may be displayed in a meaningful way. In addition, th tags in the bookmark carry enough information for the user to construct a SPARQL qury that will generate the data that is displayed by the map. A utility Yahoo pipe (ID 22059ac6f967d7117b1262586c92b371) can take a delicious RSS feed containing the bookmark and generate a SPARQLProxy URI that calls the tagged endpoint with the tagged query and generates the tag specified output. There is then a break in the chain. A utility pipe capable of handling SPPARQLProxy generated CSV from data.gov.uk transport data generates a KML version of the data, which is then passed to a Google map.

So what? So I don’t know what… but it has got me thinking that what might be useful is a quick way of sharing:
– info that is sufficient to generate and run a particular SPARQL query.
– along with a link to an ‘end user’ demo showing how that data might be used or displayed
– all in one handy package…

Which is what the above does… sort of.. though it’s still way too complicated for mortals…

PS It occurs to me that it might be possible to define a pipeline using delicious tags too? Maybe something like:
pipeline:22059ac6f967d7117b1262586c92b371?name=psychemedia&tag=transportDemo//
1c77faa95919df9dbea678a6bf4881c6?_render=kml//
gmap

where the first step in the pipeline (2059ac6f967d7117b1262586c92b371?name=psychemedia&tag=transportDemo) says “run the delicious RSS feed from psychemedia/transportDemo (assuming I had recursively bookmarked that record with transportDemo) through the pipe with ID 2059ac6f967d7117b1262586c92b371 to generate the SPARQLProxy URI, then pass that to the pipe with ID 1c77faa95919df9dbea678a6bf4881c6 which should generate a KML output, which in turn should be sent to a Google map?!

Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)

Disclaimer: before any Linked Data purists say I’m missing the point about what Linked Data is, does, or whatever, I don’t care, okay? I just don’t care. This is for me, and me is someone who can’t get to grips with writing SPARQL queries, can’t stand the sight of unreadable <rdf> <all:over the=”place”>, can’t even work out how to find things are queryable in a Linked Data triple store, let alone write queries that link data from one data store with data from another data store (or maybe SPARQL can’t do that yet? Not that I care about that either, because I can, in Yahoo Pipes, or Google Spreadsheets, and in a way that’s meaningful to me…)

In Mulling Over =datagovukLookup() in Google Spreadsheets, I started wondering about whether or not it would be useful to be able to write formulae to look up “live facts” in various datastores from within a spreadsheet (you know, that Office app that is used pretty much universally in workplace whenever there is tabular data to hand. That or Access of course…)

Anyway, I’ve started tinkering with how it might work, so now I can do things like this:

The formulae in columns G, H and I are defined according to a Google Apps script, that takes a school ID and then returns something linked to it in the data.gov.uk education datastore, such as the name of the school, or its total capacity.

Formulae look like:

  • =datagovuk_education(A2,”name”)
  • =datagovuk_education(A2,”opendate”)
  • =datagovuk_education(A2,”totcapacity”)

and are defined to return a single cell element. (I haven’t worked out how to return several cells worth of content from a Google Apps Script yet!)

At the moment, te script is a little messy, taking the form:

function datagovuk_education(id,typ) {
  var ret=""; var args=""
  switch (typ) {
    case 'totcapacity':
      args= _datagovuk_education_capacity_quri(id);
      break;
    ...
    default:
      //hack something here;
  }
  var x=UrlFetchApp.fetch('http://data-gov.tw.rpi.edu/ws/sparqlproxy.php',{method: 'post', payload: args});
  var ret=x.getContentText();
  var xmltest=Xml.parse(ret);
  ret=xmltest.sparql.results.result.binding.literal.getText();

  return ret;
}
function _datagovuk_education_capacity_quri(id){
  return "query=prefix+sch-ont%3A+%3Chttp%3A%2F%2Feducation.data.gov.uk%2Fdef%2Fschool%2F%3E%0D%0ASELECT+%3FschoolCapacity+WHERE+{%0D%0A%3Fschool+a+sch-ont%3ASchool%3B%0D%0Asch-ont%3AuniqueReferenceNumber+"+id+"%3B%0D%0Asch-ont%3AschoolCapacity+%3FschoolCapacity.%0D%0A}+ORDER+BY+DESC%28%3Fdate%29+LIMIT+1&output=xml&callback=&tqx=&service-uri=http%3A%2F%2Fservices.data.gov.uk%2Feducation%2Fsparql";
}

The datagovuk_education(id,typ) function takes the school ID and the requested property, uses the case statement to create an appropriate query string, and then fetches the data from the education datastore, returning the result in an XML format like this. The data is pulled from the datastore via Sparqlproxy, and the query string URIs generated (at the moment) by adding the school ID number into a query string generated by running the desired SPARQL query on Sparqlproxy and then grabbing the appropriate part of the URI. (It’s early days yet on this hack!;-)

By defining appropriate Apps script functions, I can also create formulae to call other datastores, run queries on Google spreadsheets (e.g. in the Guardian datastore) and so on. I assume similar sorts of functionality would be supported using VB Macros in Excel?

Anyway – this is my starter for ten on how to make live datastore data available to the masses. It’ll be interesting to see whether this approach (or one like it) is used in favour of getting temps to write SPARQL queries and RDF parsers… The obvious problem is that my approach can lead to an explosion in the number of formulae and parameters you need to learn; the upside is that I think these could be quite easily documented in a matrix/linked formulae chart. The approach also scales to pulling in data from CSV stores and other online spreadsheets, using spreadsheets as a database via the =QUERY() formula (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula), and so on. There might also be a market for selling prepackaged or custom formulae as script bundles via a script store within a larger Google Apps App store

PS I’m trying to collect example SPARQL queries that run over the various data.gov.uk end points because: a) I’m really struggling in getting my head round writing my own, not least because I struggle to make sense of the ontologies, if I can find them, and write valid queries off the back of them; even (in fact, especially) really simple training/example queries will do! b) coming up with queries that either have interesting/informative/useful results, or clearly demonstrate an important ‘teaching point’ about the construction of SPARQL queries, is something I haven’t yet got a feel for. If you’ve written any, and you’re willing to share, please post a gist to github and add a link in a comment here.

PPS utility bits, so I don’t lose track of them:
education datastore ontology
– Apps script XML Element class

PPPS HEre’s how to dump a 2D CSV table into a range of cells: Writing 2D Data Arrays to a Google Spreadsheet from Google Apps Script Making an HTTP POST Request for CSV Data

Getting Started with data.gov.uk… or not…

Go to any of the data.gov.uk SPARQL endpoints (that’s geeky techie scary speak for places where you can run geeky techie datastore query language queries and get back what looks to the eye like a whole jumble of confusing Radical Dance Faction lyrics [in joke;-0]) and you see a search box, of sorts… Like this one on the front of the finance datastore:

So, pop pickers:

One thing that I think would make the SPARQL page easier to use would be to have a list of links that would launch one of the last 10 or queries that had run in a reasonable time, returned more than no results, displayed down the left hand side – so n00bs like me could at least have a chance at seeing what a successful query looked like. Appreciating that some folk might want to keep their query secret (more on this another day…;-), there should probably be a ‘tick this box to keep your query out of the demo queries listing’ option when folk submit a query.

(A more adventurous solution, but one that I’m not suggesting at the moment, might allow folk who have run a query from the SPARQL page on the data.gov.uk site “share this query” to a database of (shared) queries. Or if you’ve logged in to the site, there may be an option of saving it as a private query.)

That is all…

PS if you have some interesting SPARQL queries, please feel free to share them below or e.g. via the link on here: Bookmarking and Sharing Open Data Queries.

PPS from @iand “shouldnt that post link to the similar http://tw.rpi.edu/weblog/2009/10/23/probing-the-sparql-endpoint-of-datagovuk/“; and here’s one from @gothwin: /location /location /location – exploring Ordnance Survey Linked Data.

PPPS for anyone who took the middle way in the vote, then if there are any example queries in the comments to this post, do they help you get started at all? If you voted “what are you talking about?” please add a comment below about what you think data.gov.uk, Linked Data and SPARQL might be, and what you’d like to be able to with them…

Too Much Information, Not Enough Data?

Trying to scope out ideas for a talk I’m due to give in Manchester next week on open civic data, I came across quite a few examples of local councils making map based “data” available – on a map. For example, Milton Keynes Council has a nice collection of links to interactive maps :

The maps look very pretty, when they load… I don’t know if it was my connection that was particularly slow when I tried to grab the screenshot, but I never did manage to load any hi-res tiles…

Milton Keynes interactive map

Manchester City Council also offers a range of map based navigation for identifying local schools:

Manchester schools maps

And if we tunnel down:

Manchester schools map

The point I wanted to make here is that what the councils are doing is more to do with displaying map based information than geocoded data. That is, it’s still hard for me to create my own map based views based on the stuff the councils are publishing.

The valuable goods for me in the role of a developer is flexible access to the raw data so that I can re-present it and make sense of it in a way that I decide.

So for example, in the case of the Manchester schools data, where I might decide to throw caution to the wind and plot all the schools on one map rather than present the information in the form of schools contained in arbitrarily drawn regions, it would be nice to be able to get a raw data feed of schools under the control of the Manchester local authority…

…which I seem to remember is something I can get from the data.gov.uk education datastore.

Looking through my own hacks, I found a description of a Yahoo pipe (that appears to have rotted:-( that will return schools given a local authority code… but what is the code for Manchester?

A quick google turned up a post by Simon Hume entitled Using SPARQL & the data.gov.uk school data which points to a handy service that uses a query of the National Statistics SPARQL endpoint to list council ID codes (I’m not sure if this can be extended to return a Council ID code based on a user-supplied postcode? If you know how to do this, please post a clue in a comment to this post;-)

Simon’s post also contains another rather handy example of a SPARQL query over the education datastore that will “call back all the schools in your local authority”, including the lat/long coordinates, so they can be easily placed onto a map.

This is the data, the useful stuff. The maps the councils have published that are shown above display some useful information, for sure, but it’s not data…

So here’s a thought: suppose that where councils feel they’re adding value by producing maps like the ones shown above (and I think that sort of display can provide a valuable service on a council website), wouldnlt it be great if:

1) the data they used to create the map came from a public datastore, such as one of the datastores on data,gov.uk, or maybe a queryable datastore local to the council; and

2) as a footnote to the page, or more likely on a page linked from it, a description was given of the query used to generate the data rendered on the information page.

Just a thought…

Accessing Linked Data in Scraperwiki via YQL

A comment from @frabcus earlier today alerted me to the fact that the Scraperwiki team had taken me up on my suggestion that they make the Python YQL library available in the Scraperwiki environment, so I thought I ought to come up with an example of using it…

YQL provides a general purpose standard query interface “to the web”, interfacing with all manner of native APIs and providing a common way of querying with them, and receiving responses from them. YQL is extensible too – If there isn’t a wrapper for your favourite API, you can write one yourself and submit it to the community. (For a good overview of the rationale for, and philosophy behind YQL, see Christian Heilmann’s the Why of YQL.)

Browsing through the various community tables, I found one for handling SPARQL queries. The YQL wrapper expects a SPARQL query and an endpoint URL, and will return the results in the YQL standard form. (Here’s an example SPARQL query in the YQL developer console using the data.gov.uk education datastore.)

The YQL query format is:
select * from sparql.search where query=”YOUR_SPARQL_QUERY” and service=”SPARQL_ENDPOINT_URL”
and can be called in Python YQL in the following way (Python YQL usage):

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

For a couple of weeks now, I’ve been look for an opportunity to try to do something – anything – with the newly released Ordnance Survey Linked Data (read @gothwin’s introduction to it for more details: /location /location /location – exploring Ordnance Survey Linked Data – Part 2).

One of the things the OS Linked Data looks exceedingly good for is acting as glue, mapping between different representations for geographical and organisational areas; the data can also return regions that neighbour on a region, which could make for some interesting “next door to each other” ward, district or county level comparisons.

One of the most obvious ways in to the data is via a postcode. The following Linked Data query to the ordnance survey SPARQL endpoint (http://api.talis.com/stores/ordnance-survey/services/sparql) returns the OS district ID, ward and district name that a postcode exists in:
PREFIX skos: <http://www.w3.org/2004/02/skos/core#&gt;
PREFIX postcode: < http://data.ordnancesurvey.co.uk/ontology/postcode/&gt;

select ?district ?wardname ?districtname where { <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;
postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}

Here is is running in the YQL developer console:

OS Posctcode query in YQL developer console

(Just by the by, we can create a query alias for that query if we want, by changing the postcode (MK76AA in the example to @postcode. This gives us a URL argument/variable called postcode whose value gets substituted in to the query whenever we call it:

http://query.yahooapis.com/v1/public/yql/psychemedia/ospostcodelookupdemo1?postcode=INSERTPOSTCODEHERE&env=http://datatables.org/alltables.env

[Note we manually need to add the environment variable &env=http://datatables.org/alltables.env to the URL created by the query alias generator/wizard.]

YQL query alieas for sparql query

So… that’s SPARQL in YQL – but how can we use it in Scraperwiki… The newly added YQL wrapper makes it easy.. here’s an example, based on the above:

os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'

os_query='''
PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname
}
'''
postcode="MK7 6AA"

os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']}
    scraperwiki.datastore.save(["id"], record) 

I use the MAGICPOSTCODE substitution to give me the freedom to create a procedure that will take in a postcode argument and add it in to the query. Note that I am probably breaking all sorts of Linked Data rule by constructing the URL that uniquely identifies (reifies?) the postcode in the ordnance survey URL namespace (that is, I construct something like <http://data.ordnancesurvey.co.uk/id/postcodeunit/MK76AA&gt;, which contravenes the “URIs are opaque” rule that some folk advocate, but I’m a pragmatist;-)

Anyway, here’s a Scraperwiki example that scrapes a postcode from a web page, and looks up some of its details via the OS: simple Ordnance Survey Linked Data postcode lookup

The next thing I wanted to do was use two different Linked Data services. Here’s the setting. Suppose I know a postcode, and I want to lookup all the secondary schools in the council area that postcode exists in. How do I do that?

The data.gov.uk education datastore lets you look up schools in a council area given the council ID. Simon Hume gives some example queries to the education datastore here: Using SPARQL & the data.gov.uk school data. The following is a typical example:

prefix sch-ont: <http://education.data.gov.uk/def/school/&gt;

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt; ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation .
}

Here, the secondary schools are being identified according to the district area they are in (00MG in this case).

But all I have is the postcode… Can Linked Data help me get from MK7 6AA to 00MG (or more specifically, from <http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE&gt; to <http://statistics.data.gov.uk/id/local-authority-district/00MG&gt;?)

Here’s what the OS knows about a postcode:

What the OS knows about a postcode

If we click on the District link, we can see what the OS knows about a district:

Local authority area code lookup in OS Linked Data

The Census Code corresponds to the local council id code used in the Education datastore (thanks to John Goodwin for pointing that out…). The identifier doesn’t provide a Linked Data URI, but we can construct one out of the code value:
<http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE&gt;

(Note that the statistics.data.gov.uk lookup on the district code does include a sameas URL link back to the OS identifier.)

Here’s how we can get hold of the district code – it’s the dmingeo:hasCensusCode you’re looking for:

PREFIX skos: <http://www.w3.org/2004/02/skos/core#>
PREFIX admingeo: <http://data.ordnancesurvey.co.uk/ontology/admingeo/>
PREFIX postcode: <http://data.ordnancesurvey.co.uk/ontology/postcode/>

select ?district ?nsdistrict ?wardname ?districtname where {
<http://data.ordnancesurvey.co.uk/id/postcodeunit/MAGIC_POSTCODE> postcode:district ?district; postcode:ward ?ward.
?district skos:prefLabel ?districtname.
?ward skos:prefLabel ?wardname .
?district admingeo:hasCensusCode ?nsdistrict.
}
'''

postcode='MK7 6AA'
os_endpoint='http://api.talis.com/stores/ordnance-survey/services/sparql'
os_query=os_query.replace('MAGIC_POSTCODE',postcode.replace(' ',''))

result=run_sparql_query(os_query, os_endpoint)

for row in result.rows:
    print row['result']['nsdistrict']['value']
    districtcode=row['result']['nsdistrict']['value']
    print postcode,'is in the',row['result']['wardname']['value'],'ward of',row['result']['districtname']['value']
    record={ "id":postcode, "ward":row['result']['wardname']['value'],"district":row['result']['districtname']['value']} 

So what does that mean… well. we managed to look up the district code from a postcode using the Ordnance Survey API, which means we can insert that code into a lookup on the education datastore to find schools in that council area:

def run_sparql_query(query, endpoint):
    '''
    # The following string replacement construction may be handy
    query = 'select * from flickr.photos.search where text=@text limit 3';
    y.execute(query, {"text": "panda"})
    '''
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

edu_endpoint='http://services.data.gov.uk/education/sparql'    

edu_query='''
prefix sch-ont:  <http://education.data.gov.uk/def/school/>

SELECT ?name ?reference ?date WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:uniqueReferenceNumber ?reference ;
sch-ont:districtAdministrative <http://statistics.data.gov.uk/id/local-authority-district/MAGIC_DISTRICTCODE> ;
sch-ont:openDate ?date ;
sch-ont:phaseOfEducation <http://education.data.gov.uk/def/school/PhaseOfEducation_Secondary>.
}
'''
districtcode='00MG'
edu_query=edu_query.replace('MAGIC_DISTRICTCODE',districtcode)
result=run_sparql_query(edu_query, edu_endpoint)
for row in result.rows:
    for school in row['result']:
        print school['name']['value'],school['reference']['value'],school['date']['value']
        record={ "id":school['reference']['value'],"name":school['name']['value'],"openingDate":school['date']['value']}
        scraperwiki.datastore.save(["id"], record) 
​

Here’s a Scraperwiki example showing the two separate Linked Data calls chained together (click on the “Edit” tab to see the code).

Linked Data in Scraperwiki

Okay – so that easy enough (?!;-). We’ve seen how:
– Scraperwiki supports calls to YQL;
– how to make SPARQL/Linked Data queries from Scraperwiki using YQL;
– how to get data from one Linked Data query and use it in another.

A big problem though is how do you know whether there is a linked data path from a data element in one Linked Data store (e.g. from a postcode lookup in the Ordnance Survey data) through to another datastore (e.g. district area codes in the education datastore), where you is a mere mortal and not a Linked Data guru?! Answers on the back of a postcard, please, or via the comments below;-)

PS whilst doing a little digging around, I came across some geo-referencing guidance on the National Statistcics website that suggests that postcode areas might change over time (they also publish current and previous postcode info). So what do we assume about the status (currency, validity) of the Ordnance Survey postcode data?

PPS Just by the by, this may be useful to folk looking for Linked Data context around local councils: @pezholio’s First steps to councils publishing their own linked data

data.open.ac.uk Linked Data Now Exposing Module Information

As HE becomes more and more corporatised, I suspect we’re going to see online supermarkets appearing that help you identify – and register on – degree courses in exchange for an affiliate/referral fee from the university concerned. For those sites to appear, they’ll need access to course catalogues, of course. UCAS currently holds the most comprehensive one that I know of, but it’s a pain to scrape and all but useless as a datasource. But if the universities publish course catalogue information themselves in a clean way (and ideally, a standardised way), it shouldn’t be too hard to construct aggregation sites ourselves…

So it was encouraging to see earlier this week an announcement that the OU’s data.open.ac.uk site has started publishing module data from the course catalogue – that is, data about the modules (as we now call them – they used to be called courses) that you can study with the OU.

The data includes various bits of administrative information about each module, the territories it can be studied in, and (most importantly?!) pricing information;-)

data.open.ac.uk - module data

You may remember that the data.open.ac.uk site itself launched a few weeks ago with the release of Linked Data sets including data about deposits in the open repository, as well as OU podcasts on iTunes (data.open.ac.uk Arrives, With Linked Data Goodness. Where podcasts are associated with a course, the magic of Linked Data means that we can easily get to the podcasts via the course/module identifier:

data.open.ac.uk

It’s also possible to find modules that bear an isSimilarTo relation to the current module, where isSimilarTo means (I think?) “was also studied by students taking this module”.

As an example of how to get at the data, here’s a Python script using the Python YQL library that lets me run a SPARQL query over the data.open.ac.uk course module data (the code includes a couple of example queries):

import yql

def run_sparql_query(query, endpoint):
    y = yql.Public()
    query='select * from sparql.search where query="'+query+'" and service="'+endpoint+'"'
    env = "http://datatables.org/alltables.env"
    return y.execute(query, env=env)

endpoint='http://data.open.ac.uk/query'

# This query finds the identifiers of postgraduate technology courses that are similar to each other
q1='''
select distinct ?x ?z from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z
} limit 10
'''

# This query finds the names and course codes of 
# postgraduate technology courses that are similar to each other
q2='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://purl.org/dc/terms/subject> <http://data.open.ac.uk/topic/technology>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

# This query finds the names and course codes of 
# postgraduate courses that are similar to each other
q3='''
select distinct ?code1 ?name1 ?code2 ?name2 from <http://data.open.ac.uk/context/course> where {
?x a <http://purl.org/vocab/aiiso/schema#Module>.
?x <http://data.open.ac.uk/saou/ontology#courseLevel> <http://data.open.ac.uk/saou/ontology#postgraduate>.
?x <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name1.
?x <http://purl.org/goodrelations/v1#isSimilarTo> ?z.
?z <http://courseware.rkbexplorer.com/ontologies/courseware#has-title> ?name2.
?x <http://purl.org/vocab/aiiso/schema#code> ?code1.
?z <http://purl.org/vocab/aiiso/schema#code> ?code2.
}
'''

result=run_sparql_query(q3, endpoint)

for row in result.rows:
	for r in row['result']:
		print r

I’m not sure what purposes we can put any of this data to yet, but for starters I wondered just how connected the various postgraduate courses are based on the isSimilarTo relation. Using q3 from the code above, I generated a Gephi GDF/network file using the following snippet:

# Generate a Gephi GDF file showing connections between 
# modules that are similar to each other
fname='out2.gdf'
f=open(fname,'w')

f.write('nodedef> name VARCHAR, label VARCHAR, title VARCHAR\n')
ccodes=[]
for row in result.rows:
	for r in row['result']:
		if r['code1']['value'] not in ccodes:
			ccodes.append(r['code1']['value'])
			f.write(r['code1']['value']+','+r['code1']['value']+',"'+r['name1']['value']+'"\n')
		if r['code2']['value'] not in ccodes:
			ccodes.append(r['code2']['value'])
			f.write(r['code2']['value']+','+r['code2']['value']+',"'+r['name2']['value']+'"\n')
		
f.write('edgedef> c1 VARCHAR, c2 VARCHAR\n')
for row in result.rows:
	for r in row['result']:
		#print r
		f.write(r['code1']['value']+','+r['code2']['value']+'\n')

f.close()

to produce the following graph. (Size is out degree, colour is in degree. Edges go from ?x to ?z. Layout: Fruchterman Reingold, followed by Expansion.)

OU postgrad courses in gephi

The layout style is a force directed algorithm, which in this case has had the effect of picking out various clusters of highly connected courses (so for example, the E courses are clustered together, as are the M courses, B courses, T courses and so on.)

If we run the ego filter over this network on a particular module code, we can see which modules were studying alongside it:

ego filter on course codes

Note that in the above diagram, the nodes are sized/coloured according to in-degree/out-degree in the original, complete graph, If we re-calculate those measures on just this partition, we get the following:

Recoloured course network

If we return to the whole network, and run the Modularity class statistic, we can identify several different course clusters:

Modules - modularity class

Here’s one of them expanded:

A module cluster

Here are some more:

COurse clusters

I’m not sure what use any of this is, but if nothing else, it shows there’s structure in that data (which is exactly what we’d expect, right?;-)

PS as to how I wrote my first query on this data, I copied the ‘postgraduate modules in computing’ example query from data.open.ac.uk:

http://data.open.ac.uk/query?query=select%20distinct%20%3Fx%20from%20%3Chttp://data.open.ac.uk/context/course%3Ewhere%20{%3Fx%20a%20%3Chttp://purl.org/vocab/aiiso/schema%23Module%3E.%0A%3Fx%20%3Chttp://data.open.ac.uk/saou/ontology%23courseLevel%3E%20%3Chttp://data.open.ac.uk/saou/ontology%23postgraduate%3E.%0A%3Fx%20%3Chttp://purl.org/dc/terms/subject%3E%20%3Chttp://data.open.ac.uk/topic/computing%3E%0A}%0A&limit=200

and pasted it into a tool that “unescapes” encoded URLs, which encodes the SPARQL query:

Unescaping text

I was then able to pull out the example query:
select distinct ?x from <http://data.open.ac.uk/context/course&gt;
where {?x a <http://purl.org/vocab/aiiso/schema#Module&gt;.
?x <http://data.open.ac.uk/saou/ontology#courseLevel&gt; <http://data.open.ac.uk/saou/ontology#postgraduate&gt;.
?x <http://purl.org/dc/terms/subject&gt; <http://data.open.ac.uk/topic/computing&gt;
}

Just by the by, there’s a host of other handy text tools at Text Mechanic.