Accessing Government Education Data in Scraperwiki via the Edubase/Education Datastore API

There’s lots of education data out there, but do we always need to scrape it from LEA websites? How can we easily access the data that’s in the central government datastore, and bring it into an environment we’re happy working from?

Although lots of school data has been made available as data for some time now, it’s publication as Linked Data means there’s a considerable barrier to entry in terms of functional access to, and use of, the data. (For an example of how to interrogate the Edubase Linked Data API in Scraperwiki, see Accessing Linked Data in Scraperwiki via YQL.) As an attempt to mask some of the horrors of SPARQL from mortal users, @jenit et al worked on a simpler webservice that meant you could access the data.gov.uk education datastore via a “friendly”, human readable URL, such as the following:

  • http://services.data.gov.uk/education/api/school/constituency-name/Horsham: list of schools within the constituency of Horsham
  • /education/api/school/local-authority-name/{la}: schools by local authority name
  • /education/api/school/district/{district} : schools by district ID (I think? Not sure exactly what ID’s these might be?)
  • /education/api/school/area/{minLat},{minLong};{maxLat},{maxLong}: schools within a particular geogrpahical area, as described by a latitude/longitude bounding box.

There’s a wealth of data that can be returned about a school, so various views over the data are also supported using a URL variable (for example, ?_view=provision or &_view=performance

short – shows very basic information
medium – shows a few more fundamental details about the schools, such as its address
provision – describes the kind and number of children that they take
location – describes where the school is
performance – gives information related to their performance
admin – gives administrative information
all – gives you everything that’s known about each school

If you know any particular data attributes you want to further filter the results on, they can be specified literally. For example, the following (far from complete) list of attributes gives some idea of what’s possible, this time passed via explicit URL args:

  • ?nurseryProvision=true
  • &gender.label=Girls
  • ofstedSpecialMeasures=true
  • for searching number ranges, the min- and max- prefixes may be applied to certain parameters. For example: &max-statutoryHighAge=10 searches for schools where statutoryHighAge<=10

Jeni did a great write up of the API at A Developers’ Guide to the Linked Data APIs – Jeni Tennison (which I cribbed from heavily in the above;-). You can find a full overview of the education API documentation here: Linked Data API Configuration APIs: Edubase API

So… how can we use this in Scraperwiki? Here’s a demo:

import simplejson
import urllib
import scraperwiki

#------- USER SETTINGS ------
# Original API documentation at: http://services.data.gov.uk/education/api/api-config#schools
# Original blog post by @jenit describing the API used: http://data.gov.uk/blog/guest-post-developers-guide-linked-data-apis-jeni-tennison
# Original blog post describing this Scraperwiki page: https://blog.ouseful.info/2010/11/03/accessing-government-education-data-in-scraperwiki-via-the-edubaseeducation-datastore-api/

# The main query
eduPath='school/constituency-name/Horsham'

# Filters, as a list:
eduFilters=['min-statutoryHighAge=7','max-statutoryHighAge=10']

# _views - not considered yet...

# key and label data is displayed in the console for each result, and added to the Scraperwiki database
# keys are the top level attributes we want to display. For a result item, display each item[key]
keys=['establishmentNumber','label']

# labels are used to display labels of top level items, e.g. item[label]['label']
labels=['typeOfEstablishment','phaseOfEducation']
# Note, if you have item[path][wherever][label], or deeper down a path, we don't handle that (yet?!)

# The school ID will always be added to the Scraperwiki database (it's the database ID for a record).
# If latitude/longitude data is available, it will also be added to the database.


# Note that the script doesn't yet handle multiple pages of results either...

#-------------------------- 
  
# This function displays the results, and also adds results to the Scraperwiki database.
# We always look for school ID (this is the table ID) and latlng for mapping, if that data exists
def printDetails(item,keys=['establishmentNumber','label'],labels=[]):
    txt=[]
    record={}
    for key in keys:
        if key in item:
            txt.append(str(item[key]))
            record[key]=item[key]
        else:
            record[key]=''
    if 'establishmentNumber' not in keys:
        record['establishmentNumber']=item['establishmentNumber']
    for attribute in labels:
        if attribute in item:
            txt.append(item[attribute]['label'])
            record[attribute]=item[attribute]['label']
        else:
            record[attribute]=''
    if 'lat' in item:
        latlng=(item['lat'],item['long'])
        scraperwiki.datastore.save(["establishmentNumber"], record,latlng=latlng)
    else:
        scraperwiki.datastore.save(["establishmentNumber"], record)
        pass
    print ', '.join(txt)    
    
    
# This is where we construct the Edubase Linked Data API URL, and then call it, returning JSON
# Need to find a way of handling results spread over several results pages
data=simplejson.load(urllib.urlopen('http://services.data.gov.uk/education/api/'+eduPath+'.json'+'?'+'&'.join(eduFilters)))['result']
items=data["items"]

for item in items:
    printDetails(item,keys,labels)
    print item

You can find the code running on Scraperwiki here: ouseful scraperwiki – playing with Education datastore API

Here’s an example of what gets put in the Scaperwiki database:

Example scraperwiki datatable - education datastore API

Hopefully what this demo does is show how you can start exploring the Education datastore in Scraperwiki withougt having to do too much. More explanation/guidance, or at least futher examples, are required in order to demonstrate:
– the construction of valid “eduPath” statements, if possible showing how they can reuse identifier codes from other sources;
– the use of different _views, and maybe handlers for those views that add all the data to the Scraperwiki database automagically;
– how to inspect returned results so you can identify what keys and labels can be used from a result when you want to construct your own Scraperwiki database records;
– handlers for data down the result item path (i.e. more than handlers just for item[key] and item[label][‘label’], but also item[here][there], item[here][there][everywhere][‘label’] etc.)
– results are only pulled back from the first page of results; need to find some way of handling results over multiple pages, maybe limiting results to a max number of results within that. (Maybe the tweepy Cursor code could be reused for this???)

Library Location Data on data.gov.uk

Where do I find a list of of UK public libraries so that I can:
a) plot them on a map;
b) search for them by location (e.g. within five miles of a particular postcode)?

For UK Onine Centres, it’s easy enough (5 Minute Hack – UK Centres Online Map); and there’s the start of a hack for plotting out the locations of all HEI libraries via SCONUL data (UK HE LIbraries map); but when I asked the lazyweb this question with respect to public libraries at the end of last week, the best response I got back was to use Librarything (Libraries Near Me Map (Courtesy of LibraryThing)).

So how about data.gov.uk? Has any library data been logged there? Searching for library didn’t turn up much at all, though libraries was a little more successful; here’s the result for library OR libraries

Data.gov.uk - libraries

More specifically, at the current time, there are the following relevant datasets:

Libraries
A list of libraries in Warwickshire, including contact details Overview … Tags warwickshire-county-council libraries warwickshire Department … Council Wiki Libraries Resources Other XML …

Sunderland Libraries
List of all Libraries in Sunderland. Overview … Tags local-authority-services libraries sunderland Department … Wiki Sunderland Libraries Resources …

East Staffordshire Libraries
A list of the libraries in East Staffordshire. Overview … – Tags libraries east-staffordshire Department … Wiki East Staffordshire Libraries Resources Other XML …

Bournemouth Libraries
Details of Bournemouth Libraries, including location, opening times Overview … Tags bournemouth-borough-council libraries local-authority bournemouth … Wiki Bournemouth Libraries Resources …

Hampshire County Council Recreation Sites
… gardens reserve arts country-park library nature Department … Hampshire County Council, hampshire-county-council, library, museum, nature, reserve) …

The data is provided in a range of formats (I think I really need to do a set of recipes for handling each type..?):

East Staffordshore Libraries provide a KML file, the URL of which can be pasted directly in to a Google search box to generate the corresponding map view (and hence an embeddable version etc.)

East Staffs Libraries

Sunderland libraries use another recognised XML format, the ESRI ArcGIS 9.2 XML representation:

Sunderland libraries - ESRI

Warwickshire libraries look to be using their own brand of XML, giving name, structured address, phone, fax, email, web address, co-ordinates (a lat, long pair) and a photo:

Warwickshire libraries

Bournemouth Libraries publish their data as CSV including name, address, phone, email, an empty A column (A HREF=…? ie a slot for URLs?), and opening times by day of week; the geodata is provided as eastings and northings:

Bournemouth libraries CSV

Hampshire County Council list their libraries in the context of recreational sites (cultural and leisure services is another description that often gets used, I think?) as a CSV file, with headers Name,Address,Town,Phone,Postcode,,Lat/long,URL,Short description,Tags. (NB there is an unlabeled column in there – the lat/long column header should be split over two columns – one for lat, one for long); libraries are identified from the description column (Local Library) and from the tag column (library).

Hampshire recreation services

So – what can we say? Five data sets describing the location of libraries from five councils using five different representations. Which is best? Which is easiest to generate? Which is easier to use, for what, by whom?

I think I need a couple more posts on this topic – one looking at ways of actually getting a quick map based preview of the data in the data sets, one looking at the issues involved with generating and publishing the data on the council side. (If any council developers read this, I’d appreciate your comments about how you would go about generating these data files and what issues there are wrt getting the data out in different formats.)

PS One of the many map related things I haven’t tried yet (the following was one of the things I wanted to do had I made it to today’s Chips and Mash Mashed Library event) is to have a play the Google (yes, I know… sorry…) Maps data api, which allows you to import a geodata set as KML or CSV data, and then use the API “to filter a large set of features by radius, bounding box, or text attributes, and sort them by their distance from a location”. I’m not sure what the licensing is around the scraped SCONUL library location data, but it’d be a good candidate for a practical, proof-of-concept play with the Google Maps data API… It would also be interesting to see how well it copes with the two flavours of CSV (one using lat/long, one with eastings/northings) described above…

Getting Started with data.gov.uk… or not…

Go to any of the data.gov.uk SPARQL endpoints (that’s geeky techie scary speak for places where you can run geeky techie datastore query language queries and get back what looks to the eye like a whole jumble of confusing Radical Dance Faction lyrics [in joke;-0]) and you see a search box, of sorts… Like this one on the front of the finance datastore:

So, pop pickers:

One thing that I think would make the SPARQL page easier to use would be to have a list of links that would launch one of the last 10 or queries that had run in a reasonable time, returned more than no results, displayed down the left hand side – so n00bs like me could at least have a chance at seeing what a successful query looked like. Appreciating that some folk might want to keep their query secret (more on this another day…;-), there should probably be a ‘tick this box to keep your query out of the demo queries listing’ option when folk submit a query.

(A more adventurous solution, but one that I’m not suggesting at the moment, might allow folk who have run a query from the SPARQL page on the data.gov.uk site “share this query” to a database of (shared) queries. Or if you’ve logged in to the site, there may be an option of saving it as a private query.)

That is all…

PS if you have some interesting SPARQL queries, please feel free to share them below or e.g. via the link on here: Bookmarking and Sharing Open Data Queries.

PPS from @iand “shouldnt that post link to the similar http://tw.rpi.edu/weblog/2009/10/23/probing-the-sparql-endpoint-of-datagovuk/“; and here’s one from @gothwin: /location /location /location – exploring Ordnance Survey Linked Data.

PPPS for anyone who took the middle way in the vote, then if there are any example queries in the comments to this post, do they help you get started at all? If you voted “what are you talking about?” please add a comment below about what you think data.gov.uk, Linked Data and SPARQL might be, and what you’d like to be able to with them…

Using Data From Linked Data Datastores the Easy Way (i.e. in a spreadsheet, via a formula)

Disclaimer: before any Linked Data purists say I’m missing the point about what Linked Data is, does, or whatever, I don’t care, okay? I just don’t care. This is for me, and me is someone who can’t get to grips with writing SPARQL queries, can’t stand the sight of unreadable <rdf> <all:over the=”place”>, can’t even work out how to find things are queryable in a Linked Data triple store, let alone write queries that link data from one data store with data from another data store (or maybe SPARQL can’t do that yet? Not that I care about that either, because I can, in Yahoo Pipes, or Google Spreadsheets, and in a way that’s meaningful to me…)

In Mulling Over =datagovukLookup() in Google Spreadsheets, I started wondering about whether or not it would be useful to be able to write formulae to look up “live facts” in various datastores from within a spreadsheet (you know, that Office app that is used pretty much universally in workplace whenever there is tabular data to hand. That or Access of course…)

Anyway, I’ve started tinkering with how it might work, so now I can do things like this:

The formulae in columns G, H and I are defined according to a Google Apps script, that takes a school ID and then returns something linked to it in the data.gov.uk education datastore, such as the name of the school, or its total capacity.

Formulae look like:

  • =datagovuk_education(A2,”name”)
  • =datagovuk_education(A2,”opendate”)
  • =datagovuk_education(A2,”totcapacity”)

and are defined to return a single cell element. (I haven’t worked out how to return several cells worth of content from a Google Apps Script yet!)

At the moment, te script is a little messy, taking the form:

function datagovuk_education(id,typ) {
  var ret=""; var args=""
  switch (typ) {
    case 'totcapacity':
      args= _datagovuk_education_capacity_quri(id);
      break;
    ...
    default:
      //hack something here;
  }
  var x=UrlFetchApp.fetch('http://data-gov.tw.rpi.edu/ws/sparqlproxy.php',{method: 'post', payload: args});
  var ret=x.getContentText();
  var xmltest=Xml.parse(ret);
  ret=xmltest.sparql.results.result.binding.literal.getText();

  return ret;
}
function _datagovuk_education_capacity_quri(id){
  return "query=prefix+sch-ont%3A+%3Chttp%3A%2F%2Feducation.data.gov.uk%2Fdef%2Fschool%2F%3E%0D%0ASELECT+%3FschoolCapacity+WHERE+{%0D%0A%3Fschool+a+sch-ont%3ASchool%3B%0D%0Asch-ont%3AuniqueReferenceNumber+"+id+"%3B%0D%0Asch-ont%3AschoolCapacity+%3FschoolCapacity.%0D%0A}+ORDER+BY+DESC%28%3Fdate%29+LIMIT+1&output=xml&callback=&tqx=&service-uri=http%3A%2F%2Fservices.data.gov.uk%2Feducation%2Fsparql";
}

The datagovuk_education(id,typ) function takes the school ID and the requested property, uses the case statement to create an appropriate query string, and then fetches the data from the education datastore, returning the result in an XML format like this. The data is pulled from the datastore via Sparqlproxy, and the query string URIs generated (at the moment) by adding the school ID number into a query string generated by running the desired SPARQL query on Sparqlproxy and then grabbing the appropriate part of the URI. (It’s early days yet on this hack!;-)

By defining appropriate Apps script functions, I can also create formulae to call other datastores, run queries on Google spreadsheets (e.g. in the Guardian datastore) and so on. I assume similar sorts of functionality would be supported using VB Macros in Excel?

Anyway – this is my starter for ten on how to make live datastore data available to the masses. It’ll be interesting to see whether this approach (or one like it) is used in favour of getting temps to write SPARQL queries and RDF parsers… The obvious problem is that my approach can lead to an explosion in the number of formulae and parameters you need to learn; the upside is that I think these could be quite easily documented in a matrix/linked formulae chart. The approach also scales to pulling in data from CSV stores and other online spreadsheets, using spreadsheets as a database via the =QUERY() formula (e.g. Using Google Spreadsheets Like a Database – The QUERY Formula), and so on. There might also be a market for selling prepackaged or custom formulae as script bundles via a script store within a larger Google Apps App store

PS I’m trying to collect example SPARQL queries that run over the various data.gov.uk end points because: a) I’m really struggling in getting my head round writing my own, not least because I struggle to make sense of the ontologies, if I can find them, and write valid queries off the back of them; even (in fact, especially) really simple training/example queries will do! b) coming up with queries that either have interesting/informative/useful results, or clearly demonstrate an important ‘teaching point’ about the construction of SPARQL queries, is something I haven’t yet got a feel for. If you’ve written any, and you’re willing to share, please post a gist to github and add a link in a comment here.

PPS utility bits, so I don’t lose track of them:
education datastore ontology
– Apps script XML Element class

PPPS HEre’s how to dump a 2D CSV table into a range of cells: Writing 2D Data Arrays to a Google Spreadsheet from Google Apps Script Making an HTTP POST Request for CSV Data

Visualising Traffic Count Data from transport.data.gov.uk

In a couple of posts last year (Hackable SPARQL Queries: Parameter Spotting Tutorial and First Dabblings With Pipelinked Linked Data) I started to explore how it might be possible to use Yahoo Pipes as an environment for sharing – and chaining together (albeit inefficiently) – queries to the data.gov.uk open transport data datastore.

Those posts concentrated on querying the datastore in order to find the location of traffic monitoring points according to various search criteria. In this post, I’ll show you one way of visualising traffic count data from a traffic count point using Many Eyes Wikified.

The first thing we need to do is come up with a query that will pull traffic monitoring data back from the transport datastore. My first point of call for finding a query to get me started is usually to search over the data.gov.uk Google group archive in my mailbox. As ever, @jenit had posted a ‘getting started’ solution:-)
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX traffic: <http://transport.data.gov.uk/0/ontology/traffic#&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;

SELECT ?direction (SUM(?value) AS ?total)
WHERE {
<http://transport.data.gov.uk/0/id/traffic-count-point/7422&gt;
traffic:count ?count .
?count a traffic:Count ;
traffic:category <http://transport.data.gov.uk/0/category/bus&gt; ;
traffic:direction ?direction ;
traffic:hour ?hour ;
rdf:value ?value .
FILTER (
REGEX(str(?hour), "^2008-")
)
}
GROUP BY ?direction

I tweaked this a little (using guesswork and pattern matching, rather than understanding, Chinese Room style;-) to come up with a tweaked query that appears to pull out traffic count data for different categories of vehicle on a particular day from a particular monitoring point:

SELECT ?vehicle ?direction ?hour (SUM(?value) AS ?total)
WHERE {
<http://transport.data.gov.uk/0/id/traffic-count-point/7422&gt;
traffic:count ?count .
?count a traffic:Count ;
traffic:category ?vehicle ;
traffic:direction ?direction ;
traffic:hour ?hour ;
rdf:value ?value .
FILTER (
REGEX(str(?hour), "^2008-05-02T") )
}
GROUP BY ?vehicle ?direction ?hour

I then turned this into a Yahoo Pipes query block, using the date and traffic monitoring point as input parameters:

Here’s how to do that:

Having got the query into the pipes ecosystem, I knew I should be able to get the data out of the pipe as JSON data, or as a CSV feed, which could then be wired into other web pages or web applications. However, to get the CSV output working, it seemed like I needed to force some column headings by defining attributes within each feed item:

To tidy up the output a little bit more, we can sort it according to time and day and traffic count by vehicle type:

It’s then easy enough to grab the CSV output of the pipe (grab the RSS or JSON URI and just change the &_render=rss or &_render=json part of the URI to &_render=csv) and wire it into somewhere else – such as into Many Eyes WIkified:

Doing a couple of quick views over the data in Many Eyes wikified, it seemed as if there was some duplication of counting, in that the numbner of motor vehicles appeared to be the sum of a number of more specific vehicle types:

Data in Many Eyes WIkified

Looking at the car, van, bus, HGV and motorbike figures we get:

SUmmed totals

So I made the judgement call to remove the possibly duplicate motor-vehicle data from the data feed and reimported the data into Many Eyes WIkified (by adding some nonsense characters (e.g. &bleurghh to the feed URI so that Many Eyes thought it was a new feed.)

It was then easy enough to create some interactive visualisations around the traffic data point. SO for example, here we have a bubble chart:

Do you spot anything about traffic flow going North versus South at 8am compared to 5pm?

Lets explore that in a little more detail with a matrix chart:

Traffic data matrix chart http://wikified.researchlabs.ibm.com/ousefuldatagovuk/TrafficDataPoints:TrafficBubble

This shows us the relative counts for different vehicle types, again by time of day. Notice the different distribution of van traffic compared to car traffic over the course of the day.

A treemap gives a slightly different take on the same data – again, we can see how there is a difference between North and South flowing volumes at different times of day within each category:

One thing that jumps out at me from the treemap is how symmetrical everything seems to be at noon?!

All the above visualisations are interactive, so click through on any of the images to get to the interactibve version (Java required).

As to how to find traffic monitoring point IDs – try this.

PS a disadvantage of the above recipe is that to generate a visualisation for a different traffic point, I’d need to use the desired parameters when grabbing the CSV feed from the pipe, and then create new Many Eyes Wikified data pages and visualisation pages. However, using nothing more than a couple of web tools, I have managed to prototype a working mockup of a visualisation dashboard for traffic count data that could be given to a developer as a reference specification for a “proper” application. And in the meantime, it’s still useful as a recipe… maybe?

PPS While I was playing with this over the weekend, it struck me that if school geography projects ever do traffic monitoring surveys, it’s now possible for them to get hold of “real” data. If there are any school geography teachers out there who’d like to bounce around ways of making this data useful in a school context, please get in touch via a comment below :-)

First Dabblings With Pipelinked Linked Data

One of the promises of the Linked Data lobby is the ability to combine data from different datasets that share common elements, although this ability is not limited to Linked Data (see, for example, Mash/Combining Data from Three Separate Sources Using Dabble DB). In this post, I’ll describe a quick experiment in using Yahoo Pipes to combine data from two different data sources and briefly consider the extent to which plug’n’play data can lower the barriers to entry for exploring the potential of Linked Data.

The datasets I’ll join are both data.gov.uk Linked Data datstores – the transport datastore and the Edubase/Education datastore. The task I’ve set myself is to look for traffic monitoring points in the vicinity of one or more schools and to produce a map that looks something like this:

So to get started, let’s grab a list of schools… The Talis blog post SPARQLing data.gov.uk: Edubase Data contains several example queries over the education datastore. The query I’ll use is derived trivially from one of those examples; in particular, it grabs the name and location of the two newest schools in the UK:
prefix sch-ont: <http://education.data.gov.uk/def/school/&gt;
prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
SELECT ?school ?name ?date ?lat ?long WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:openDate ?date;
geo:lat ?lat;
geo:long ?long.
} ORDER BY DESC(?date) LIMIT 2

Pasting the query into the SPARYQL Pipe -map previewer shows a couple of points on a map, as expected.

So how can we look for traffic monitoring points located in the same area as a school? One of the big problems I have with Linked Data is finding out what the shared elements are between data sets (I don’t have a rule of thumb for doing this yet) so it’s time for some detective work – looking through example SPARQL queries on the two datasets, ploughing through the data.gov.uk Google group, and so on. Searching based on lat/long location data, e.g. within bounding box, is one possibility, but it’d be neater, to start with at least, to try to used a shared “area”, such as the same parish, or other common administrative area.

After some digging, here’s what I came up with: this snippet from a post to the data.gov.uk Google group relating to the transport datastore:
#If you’re prepared to search by (local authority) area instead of by a bounding box,
….
geo:long ?long ;
<http://geo.data.gov.uk/0/ontology/geo#area&gt; <http://geo.data.gov.uk/0/id/area/00DA&gt;;
traffic:count ?count .

and this one from the aforementioned Talis Edubase post relating to the education datastore:
prefix sch-ont:
SELECT ?name ?lowage ?highage ?capacity ?ratio WHERE {
?school a sch-ont:School;
sch-ont:districtAdministrative >http://statistics.data.gov.uk/id/local-authority-district/00HA&gt; .

The similar format of the area codes, and the similarity in language (“prepared to search by (local authority) area” and “id/local-authority-district/”) suggest to me that this two things actually refer to the same thing (I asked @jenit … it seems they do…)

So, here’s a recipe for searching for traffic monitoring locations in the same local authority district as a recently opened school. Firstly, modify the SPARQL query shown above so that it also returns the local authority area:

SELECT ?school ?name ?date ?district ?lat ?long WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:openDate ?date;
sch-ont:districtAdministrative ?district;
geo:lat ?lat;
geo:long ?long.
} ORDER BY DESC(?date) LIMIT 2

The result looks something like this:

Secondly, construct a test query on the transport datastore (http://services.data.gov.uk/transport/sparql) to pull out traffic monitoring points, along with their locations, using a local area URI as the search key:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX traffic: <http://transport.data.gov.uk/0/ontology/traffic#&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX area: <http://geo.data.gov.uk/0/ontology/geo#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?point ?lat ?long WHERE
{ ?point a traffic:CountPoint ;
geo:lat ?lat ;
geo:long ?long ;
<http://geo.data.gov.uk/0/ontology/geo#area&gt; <http://geo.data.gov.uk/0/id/area/00CG&gt;. }

We can create a pipe based around this query that takes an adminstrative area identifier, runs the query through a SPARYQL pipe, (SPARQL and YQL pipe) and returns the traffic monitoring points in that area:

The regular expression block is a hack used to put the region identifier into the form that is required by the transport endpoint if it is passed in using the form required by the education datastore.

Now we’re going to take results from the recent schools query and then look up the traffic monitoring points in that area via the pipe shown above:

The SPARYQL query at the top of the pipe runs the Edubase query and is then split – the same items are passed into ach of th two parts of the pipe, but thy are processed differently. In the left hand branch, we treat the lat and long elements from the Edubase query in order to create y:location elements that the pipe knows how to process as go elements (e.g. in the creation of a KML output from the pipe).

The right hand branch does something different: the loop block works through the list of recently opened schools on school at a time, and for each one looks up the region identifier and passes it to the traffic monitoring points by region pipe. The school item is then replaced by the list of traffic monitoring points in that region.

You can try the pipe out here: traffic monitoring near most recently opened schools

So that’s one way of doing it. Another way is to take the lat/long of each school and pass that information to a pipe that looks up the traffic monitoring points within a bounding box centered on the original location co-ordinates. This gives us a little more control over the notion of ‘traffic monitoring points in the vicinity of a school’.

Again we see a repeat of the fork and merge pattern used above, although this time th right hand branch is passed to a pip that looks up points within a bounding box specified by the latitude and longitude of each school. A third parameter specifies the size of the bounding box:

Notice from the preview of the pipe output how we have details from the left hand branch – the recently opened schools – as well as the right hand branch – the neighbouring traffic monitoring points. Here’s the result again:

As with any map previewing pipe, a KML feed is available that allows the results to be displayed in a(n embeddable) Google map:

(Quick tip: if a Google map chokes on a Yahoo pipes KML URI, use a URL shortener like TinyURL or bit.ly rto get a shortened version of the Yahoo Pipes KML URL, and then post that into the Google maps search box:-)

So there we have it – my take on using Yahoo Pipes to “join” two, err, Linked Data datasets on data.gov.uk :-) I call it pipelinked data :-)

PS some readers may remember how services like Google Fusion Tables can also be used to “join” tabular datasets sharing common columns (e.g. Data Supported Decision Making – What Prospects Does Your University Offer). Well, it seems as if the Google folks have just opened up an API to Google Fusion Tables. Now it may well be that Linked Data is the one true path to enlighentment, but don’t forget that there are many more mortals than there are astronauts…)

PPS for the promised bit on “lower[ing] the barriers to entry for exploring the potential of Linked Data”, that’ll have to wait for another post…

Hackable SPARQL Queries: Parameter Spotting Tutorial

Whenever I come across a new website or search tool, one of the first things I do is have a look at the URIs of resource pages and search results to see: a) whether I can make sense of them (that is, are they in any sense human readable), and b) whether they are “hackable”, to the extent that I can change certain parts of the URI in particular way and have a pretty good idea what the resulting page will look like.

If the URI is hackable, then it often means that it can be parameterised, in the sense that I can construct valid URIs from some sort of template within which part of the URI path, or one of the URI arguments, is replaced using a variable that can be assigned a particular value as required.

So for example, a search for the term ouseful in Google delivers the results page with URI that looks like:
http://www.google.com/search?client=safari&rls=en&q=ouseful&ie=UTF-8&oe=UTF-8

Comparing the search term that I entered (ouseful) with the URI, it’s easy to see how the search term is used in order to create the results page URI:
http://www.google.com/search?client=safari&rls=en&q=SEARCH_TERM_HERE&ie=UTF-8&oe=UTF-8

This technique applies equally to looking at SPARQL search queries, so here’s a worked through example that makes use of a query on the Talis n2 blog (I tend to use SparqlProxy for running SPARQL queries):
#List the uri, latitude and longitude for road traffic monitoring points on the M5
PREFIX road: <http://transport.data.gov.uk/0/ontology/roads#&gt;
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX geo: <http://geo.data.gov.uk/0/ontology/geo#&gt;
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?point ?lat ?long WHERE {
?x a road:Road.
?x road:number "M5"^^xsd:NCName.
?x geo:point ?point.
?point wgs84:lat ?lat.
?point wgs84:long ?long.
}

Looking carefully at the descriptive comment:

#List the uri, latitude and longitude for road traffic monitoring points on the M5

and the query:

...
?x road:number "M5"^^xsd:NCName.
...

we see how it is possible to parameterise the query such that we can replace the “M5” string with a variable and use it to pass in the details of (presumably) any UK road number.

In Yahoo Pipes, here’s what the parameterisation looks like – we construct the query string and pass in a value for the desired road number from a user text input (split the query string after ?x road:number “):

The rest of the pipe is built around the SPARYQL pattern that I have described before (e.g. Getting Started with data.gov.uk, Triplr SPARYQL and Yahoo Pipes):

By renaming the latitude and longitude value elements as y:location.lat and y:location.lon, the pipe infrastructure can do itself and provide us with a map based preview of the pipe output, as well as a KML output that can be viewed in Google maps (simply paste thee KML URI into the Google maps search box and use it as the search term) or Google Earth, for example:

Inspection of he the pipe’s KML output URL:
http://pipes.yahoo.com/pipes/pipe.run?
_id=78f6547cc12ac3ebcb84144ec3e37205
&_render=kml&roadnum=M5

shows that is is also hackable. Can you see how to change it so that it will return the traffic monitoring points on the A1, bearing in mind it currently refers to the M5?

So there we have it – given an example SPARQL query for road traffic monitoring locations on thee M5, we can parameterise the query by observation and construct a pipe that gives a map based preview, as well as a KML version of the output, all in less time than it takes to document how it was done… :-)

Here’s another example. This time the original query comes from @tommyh (geeky related stuff here;-); the query pulls a list of motorway service station locations from dbpedia:

PREFIX rdfs: <http://www.w3.org/2000/01/rdf-schema#&gt;
PREFIX dbpprop: <http://dbpedia.org/property/&gt;
PREFIX yago-class: <http://dbpedia.org/class/yago/&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
SELECT ?services ?label ?road ?lat ?long
WHERE {
?services dbpprop:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:infobox_motorway_services&gt; .
?services rdfs:label ?label
OPTIONAL {
?services dbpprop:road ?road .
?services dbpprop:lat ?lat .
?services dbpprop:long ?long .
} .
FILTER (isIRI(?road)) .
}
ORDER BY ASC(?label)

The results look like:

So how can we weak the original query to search for motorway services on the M1? By inspection of the query, we see the search is looking for services on any ?road (and more than that, on any isIRI(?road), whatever that means?!;-) Looking at the results, we see that the roads are identified in the form:
<http://dbpedia.org/resource/M40_motorway&gt;

So we can tweak the query with an additional condition that requires a particular road. For example:

WHERE {
?services dbpprop:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:infobox_motorway_services&gt; .
?services rdfs:label ?label .
?services dbpprop:road <http://dbpedia.org/resource/M1_motorway&gt;
OPTIONAL {
?services dbpprop:lat ?lat .
?services dbpprop:long ?long .
}
}

(I think we can drop the original FILTER too?)

To parameterise this query, we just ned to feed in the desired road number here:

<http://dbpedia.org/resource/ROADNUMBER_motorway>

Alternatively, we can hack in a regular expression to filter the results by road number – e.g. using the M1 again:

WHERE {
?services dbpprop:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:infobox_motorway_services&gt; .
?services rdfs:label ?label .
?services dbpprop:road ?road
OPTIONAL {
?services dbpprop:lat ?lat .
?services dbpprop:long ?long .
} .
FILTER (isIRI(?road) && regex(?road,"M1_")) .
}

This time, the parametrisation would occur here:
<em FILTER (isIRI(?road) && regex(?road,"ROADNUMBER_”))

Note that if we just did the regular expression on “M1” rather than “M1_” we’d get back results for the M11 etc as well…

In the spirit of exploration, let’s se if we can guess at/pattern match towards a little bit more. (Note guessing may or may not work – but if it doesn’t, you won’t break anything!)

The line:
?services rdfs:label ?label
would seem to suggest that human readable labels corresponding to URI identifiers may be recorded using the rdfs:label relation. So let’s see:

Create a ?roadname variable in the query and see if ?road rdfs:label ?roadname manages to pull out a useful label:
SELECT ?services ?label ?roadname ?road ?lat ?long
WHERE {
?services dbpprop:wikiPageUsesTemplate <http://dbpedia.org/resource/Template:infobox_motorway_services&gt; .
?services rdfs:label ?label .
?services dbpprop:road ?road .
?road rdfs:label ?roadname
OPTIONAL

Ooh… that seems to work (in this case, at least… maybe it’s a dbpedia convention, maybe it’s a general convention, who knows?!:-)

But it’s a little messy, with different language variants also listed. However, another trick in my toolbox is memory. I remember seeing a filter option in a query once before:
&& lang(?someLabel)=’en’

Let’s try it – change the filter terms to:
FILTER (isIRI(?road) && regex(?road,”M1_”) && lang(?roadname)=’en’) .
and see what happens:

So now I have a query that I can use to find motorway service station locations on a particular UK motorway, and get the name of the motorway back as part of the results. And all with only a modicum of knowledge/understanding of SPARQL… Instead, I relied on pattern matching, a memory of a fragment of a previous query and a bit of trial and error…

PS If you want to try out hacking around with a few other SPARQL quries, I’ve started collecting some likely candidates: Bookmarking and Sharing Open Data Queries

Bookmarking and Sharing Open Data Queries

Over the last few months, I had several aborted attempts at trying to get to grips with SPARQL’n’RDF, two key ingredients in the Linked Data initiative. So as the sort of self-directed learner who often relies on learning by example, I’ve put together a Google form to collect together example SPARQL and Google Spreadsheet (aka Guardian Datastore) queries that I can remix and reuse for my own purposes.

Here’s an example of part of the form:

The form collects a description of the query, its endpoint, and ontologies used in the query, the query itself, and optionally a link to an example output from the query, as well as other bits of info (e.g. there’s a place for a link to a blog post describing the query).

Here are some of the bookmarked queries:

At the moment, the saved queries can only be viewed in the spreadsheet, but with time allowing I hope to build a front end/explorer that will allow you to run the queries, see preview results of the queries etc etc. (Note this is intended as a tool for !astronauts to get started with/teach about/learn about/explore various datasets. Folk who would be put off by first being exposed to the RDF’n’SPARQL, rather than seeing the data in a table, plotted on a chart, etc etc. Remember, most people who see lat/long data in a table do not see in their mind’s eye a map with corresponding markers on it; they see a list of largely meaningless numbers…)

You can find the form here: QUERY Sharing Form and the results here: Example Queries [UPDATE: to reun a query, you need to find the correct endpoint and then use the PREFIX column entries, along with the SPARQL/SELECT common stuff, to run the query

e.g. this is should form the basis of a valid query in the form here:

PREFIX road: <http://transport.data.gov.uk/0/ontology/roads#>
PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#>
PREFIX geo: <http://geo.data.gov.uk/0/ontology/geo#>
PREFIX wgs84: <http://www.w3.org/2003/01/geo/wgs84_pos#>
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#>
SELECT ?point ?lat ?long WHERE {
  ?x a road:Road.
  ?x road:number "M5"^^xsd:NCName.
  ?x geo:point ?point.
  ?point wgs84:lat ?lat.
  ?point wgs84:long ?long.
}

There aren’t many examples in there at the moment, but the form is an open one and you can use it if you like… I’m also taking suggestions for how to improve the form so that other folk might b tempted into using it… ;-) (Note that I intend to keep tweaking the spreadsheet as I use it in order to make it more useful for me, if no-one else…)

The spreadsheet that collects the results should also be open as a read only document (let me know if you try to us it and have any problems doing so) so fel free to browse through the examples. I intend to put a front end of sorts onto the spreadsheet at some point using the Google Visualisation API,(cf. my original (and currently stalled) Guardian DataStore Explorer), but if you want to beat me to it, go for it :-)

Programming Pipes With Delicious and Sharing data.gov.uk SPARQL Queries As A Result

In the post Sharing Linked Data Queries With Mortals… I described a rather clunky pattern for sharing SPARQL queries onto a data.gov.uk sparql endpoint using delicious, along with a Yahoo pipe to generate a SPARQLProxy URI that identified a CSV formatted output from the query that could be consumed in something like Google spreadsheets (e.g. Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet).

In this post, I’ll refine that pattern a little more and show how to use delicious to bookmark a “processed” form of the output of the query, along with all the ingredients needed to generate that output. In a later post (hopefully before Christmas) I’ll try to show how the pattern can be used to share queries into other datastores, such as Google visualization API queries into a Google spreadsheet.

[The post describes two independent operations – firstly, how to generate a sparqlproxy uri from a set of delicious tags; secondly, how to generate a map from sparqlproxy output.]

In what follows, I’m using delicious as a database, and delicious tags as machine tags that can be used as arguments within a Yahoo pipe. The intention is not to suggest that this is even a good way of sharing SPARQL queries and demo outputs, but it does show an improvised way of how to share them,. It also provides just enough raw material to allow UI designers to think how we might capture, share and display sample use cases that go from SPARQL query to human meaningful output.

So here’s what a “finished” delicious bookmark looks like:

The three “machine tags”:
– endpoint:http://services.data.gov.uk/transport/sparql
– output:csv
– query:http://codepad.org/hPo2XIzx/raw.txt
are used as instructions in a Yahoo pipe that generates a query using SPARQLProxy:

A feed of bookmarked queries is pulled in from delicious, and checked to see that all the programming arguments that are required are there. (The sparlproxy_demo tag requirement is just a convenience.)

If all the arguments are available, their values are captured and passed to a routine to construct the SPARQLProxy URIs:

The query bookmarked by the tags is a query onto the transport database that pulls out the location traffic monitoring points on thee M5. The bookmarked URI is a demonstration of how to use the output of that query. In the current example, the bookmarked demo URI looks like this:

http://maps.google.com/maps?f=q&source=s_q&hl=en&geocode=
&q=http:%2F%2Fpipes.yahoo.com%2Fpipes%2Fpipe.run%3F_id%3D1c77faa95919df9dbea678a6bf4881c6%26_render%3Dkml
&sll=37.0625,-95.677068&sspn=44.25371,70.751953&ie=UTF8&z=8

That is, it is a bookmark to a Google map that is displaying a KML feed pulled in from a Yahoo pipe.

This is the Google Map:

This is the Yahoo pipe the KML is pulled from:

And this is a peek inside that pipe:

The URI for the fetched CSV file is on that was generated from the bookmarked query tags by the first pipe shown above.

So to recap – this bookmark:

links through to a Google map showing traffic monitoring locations on the M5. The map is an ‘end-user’ demo that shows how Government data may be displayed in a meaningful way. In addition, th tags in the bookmark carry enough information for the user to construct a SPARQL qury that will generate the data that is displayed by the map. A utility Yahoo pipe (ID 22059ac6f967d7117b1262586c92b371) can take a delicious RSS feed containing the bookmark and generate a SPARQLProxy URI that calls the tagged endpoint with the tagged query and generates the tag specified output. There is then a break in the chain. A utility pipe capable of handling SPPARQLProxy generated CSV from data.gov.uk transport data generates a KML version of the data, which is then passed to a Google map.

So what? So I don’t know what… but it has got me thinking that what might be useful is a quick way of sharing:
– info that is sufficient to generate and run a particular SPARQL query.
– along with a link to an ‘end user’ demo showing how that data might be used or displayed
– all in one handy package…

Which is what the above does… sort of.. though it’s still way too complicated for mortals…

PS It occurs to me that it might be possible to define a pipeline using delicious tags too? Maybe something like:
pipeline:22059ac6f967d7117b1262586c92b371?name=psychemedia&tag=transportDemo//
1c77faa95919df9dbea678a6bf4881c6?_render=kml//
gmap

where the first step in the pipeline (2059ac6f967d7117b1262586c92b371?name=psychemedia&tag=transportDemo) says “run the delicious RSS feed from psychemedia/transportDemo (assuming I had recursively bookmarked that record with transportDemo) through the pipe with ID 2059ac6f967d7117b1262586c92b371 to generate the SPARQLProxy URI, then pass that to the pipe with ID 1c77faa95919df9dbea678a6bf4881c6 which should generate a KML output, which in turn should be sent to a Google map?!

Sharing Linked Data Queries With Mortals…

So I know, I know, I b****y well know how important the whole RDF thing is for Linked Data, and I know we’re not there yet with respect to actual people using data pulled from data.gov.* sources, and I’m starting to be persuaded that maybe data.gov.uk is only there to feed the growth of semantic web developer capacity but ultimately, ultimately, it will probably be folk who can’t cope with anything other than a spreadsheet who are going to have to use this data…

…so the spirit in which this is offered is one of just trying to protect the interests of potential future end users while the geeky tech developer astronauts (astronauts being @iand’s term;-) do the groundwork’n’spadework and make design decisions whose full impact may not otherwise be realised for a little while yet…

So what am I offering…? A quick’n’dirty way of sharing bookmarks into the sparqlproxy Web Service that I posted about in Viewing SPARQLed data.gov.uk Data in a Google Spreadsheet.

So how does it work? Like this…

The geeky tech SPARQL speaking astronaut writes their SPAQRL query and posts it into codepad:

They grab the link to the raw text and bookmark it in delicious; the SPARQL endpoint for the query is pasted into the description, and a brief description of the query into the title; the required output is identified using an output: machine tag (e.g. output:, output:sparql, output:html, output:csv, output:xml, output:exhibit or output:gvds):

(An alternative might be to have the endpoint as the title, and the description as the description, or a brief description as a title, a full description asa description, and a endpoint: “machine tag” for the endpoint, but this was just a proof of concept, right? ;-)

The following pipe constructs the SPARQLProxy query for each bookmark using the specified query, endpoint and output type (at the moment, the pipe also requires a sparqlproxy_demo tag to be present):

A link to the result of the query, suitably transformed, is then rewritten as the link in the output feed.

A bit of tidying up on the pipe lets you specify a delicious user and/or tag as the origin of the bookmarked links…

So there we have it, an easy way to share SPARQLed queries and get access to “human usable” outputs…

PS there’s no reason, in the recipe above, not to also use the sparql endpoint URI in a tag or machine tag too, to allow for queries run over the same bookmark to be collected together and pulled out of delicious by tag/endpoint…