First Dabblings With Pipelinked Linked Data

One of the promises of the Linked Data lobby is the ability to combine data from different datasets that share common elements, although this ability is not limited to Linked Data (see, for example, Mash/Combining Data from Three Separate Sources Using Dabble DB). In this post, I’ll describe a quick experiment in using Yahoo Pipes to combine data from two different data sources and briefly consider the extent to which plug’n’play data can lower the barriers to entry for exploring the potential of Linked Data.

The datasets I’ll join are both data.gov.uk Linked Data datstores – the transport datastore and the Edubase/Education datastore. The task I’ve set myself is to look for traffic monitoring points in the vicinity of one or more schools and to produce a map that looks something like this:

So to get started, let’s grab a list of schools… The Talis blog post SPARQLing data.gov.uk: Edubase Data contains several example queries over the education datastore. The query I’ll use is derived trivially from one of those examples; in particular, it grabs the name and location of the two newest schools in the UK:
prefix sch-ont: <http://education.data.gov.uk/def/school/&gt;
prefix geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
SELECT ?school ?name ?date ?lat ?long WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:openDate ?date;
geo:lat ?lat;
geo:long ?long.
} ORDER BY DESC(?date) LIMIT 2

Pasting the query into the SPARYQL Pipe -map previewer shows a couple of points on a map, as expected.

So how can we look for traffic monitoring points located in the same area as a school? One of the big problems I have with Linked Data is finding out what the shared elements are between data sets (I don’t have a rule of thumb for doing this yet) so it’s time for some detective work – looking through example SPARQL queries on the two datasets, ploughing through the data.gov.uk Google group, and so on. Searching based on lat/long location data, e.g. within bounding box, is one possibility, but it’d be neater, to start with at least, to try to used a shared “area”, such as the same parish, or other common administrative area.

After some digging, here’s what I came up with: this snippet from a post to the data.gov.uk Google group relating to the transport datastore:
#If you’re prepared to search by (local authority) area instead of by a bounding box,
….
geo:long ?long ;
<http://geo.data.gov.uk/0/ontology/geo#area&gt; <http://geo.data.gov.uk/0/id/area/00DA&gt;;
traffic:count ?count .

and this one from the aforementioned Talis Edubase post relating to the education datastore:
prefix sch-ont:
SELECT ?name ?lowage ?highage ?capacity ?ratio WHERE {
?school a sch-ont:School;
sch-ont:districtAdministrative >http://statistics.data.gov.uk/id/local-authority-district/00HA&gt; .

The similar format of the area codes, and the similarity in language (“prepared to search by (local authority) area” and “id/local-authority-district/”) suggest to me that this two things actually refer to the same thing (I asked @jenit … it seems they do…)

So, here’s a recipe for searching for traffic monitoring locations in the same local authority district as a recently opened school. Firstly, modify the SPARQL query shown above so that it also returns the local authority area:

SELECT ?school ?name ?date ?district ?lat ?long WHERE {
?school a sch-ont:School;
sch-ont:establishmentName ?name;
sch-ont:openDate ?date;
sch-ont:districtAdministrative ?district;
geo:lat ?lat;
geo:long ?long.
} ORDER BY DESC(?date) LIMIT 2

The result looks something like this:

Secondly, construct a test query on the transport datastore (http://services.data.gov.uk/transport/sparql) to pull out traffic monitoring points, along with their locations, using a local area URI as the search key:

PREFIX rdf: <http://www.w3.org/1999/02/22-rdf-syntax-ns#&gt;
PREFIX traffic: <http://transport.data.gov.uk/0/ontology/traffic#&gt;
PREFIX geo: <http://www.w3.org/2003/01/geo/wgs84_pos#&gt;
PREFIX area: <http://geo.data.gov.uk/0/ontology/geo#&gt;
PREFIX xsd: <http://www.w3.org/2001/XMLSchema#&gt;
SELECT ?point ?lat ?long WHERE
{ ?point a traffic:CountPoint ;
geo:lat ?lat ;
geo:long ?long ;
<http://geo.data.gov.uk/0/ontology/geo#area&gt; <http://geo.data.gov.uk/0/id/area/00CG&gt;. }

We can create a pipe based around this query that takes an adminstrative area identifier, runs the query through a SPARYQL pipe, (SPARQL and YQL pipe) and returns the traffic monitoring points in that area:

The regular expression block is a hack used to put the region identifier into the form that is required by the transport endpoint if it is passed in using the form required by the education datastore.

Now we’re going to take results from the recent schools query and then look up the traffic monitoring points in that area via the pipe shown above:

The SPARYQL query at the top of the pipe runs the Edubase query and is then split – the same items are passed into ach of th two parts of the pipe, but thy are processed differently. In the left hand branch, we treat the lat and long elements from the Edubase query in order to create y:location elements that the pipe knows how to process as go elements (e.g. in the creation of a KML output from the pipe).

The right hand branch does something different: the loop block works through the list of recently opened schools on school at a time, and for each one looks up the region identifier and passes it to the traffic monitoring points by region pipe. The school item is then replaced by the list of traffic monitoring points in that region.

You can try the pipe out here: traffic monitoring near most recently opened schools

So that’s one way of doing it. Another way is to take the lat/long of each school and pass that information to a pipe that looks up the traffic monitoring points within a bounding box centered on the original location co-ordinates. This gives us a little more control over the notion of ‘traffic monitoring points in the vicinity of a school’.

Again we see a repeat of the fork and merge pattern used above, although this time th right hand branch is passed to a pip that looks up points within a bounding box specified by the latitude and longitude of each school. A third parameter specifies the size of the bounding box:

Notice from the preview of the pipe output how we have details from the left hand branch – the recently opened schools – as well as the right hand branch – the neighbouring traffic monitoring points. Here’s the result again:

As with any map previewing pipe, a KML feed is available that allows the results to be displayed in a(n embeddable) Google map:

(Quick tip: if a Google map chokes on a Yahoo pipes KML URI, use a URL shortener like TinyURL or bit.ly rto get a shortened version of the Yahoo Pipes KML URL, and then post that into the Google maps search box:-)

So there we have it – my take on using Yahoo Pipes to “join” two, err, Linked Data datasets on data.gov.uk :-) I call it pipelinked data :-)

PS some readers may remember how services like Google Fusion Tables can also be used to “join” tabular datasets sharing common columns (e.g. Data Supported Decision Making – What Prospects Does Your University Offer). Well, it seems as if the Google folks have just opened up an API to Google Fusion Tables. Now it may well be that Linked Data is the one true path to enlighentment, but don’t forget that there are many more mortals than there are astronauts…)

PPS for the promised bit on “lower[ing] the barriers to entry for exploring the potential of Linked Data”, that’ll have to wait for another post…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

9 thoughts on “First Dabblings With Pipelinked Linked Data”

  1. What really is the difference between Yahoo Pipes and DERI pipes? Yahoo pipes can make SPARQL queries (as demonstrated inthe blog), and thus can deal with the semantic web…

  2. Great article! I’m attempting to extract info for a government project, but i’m running into difficulties because im’ learning by example, but can’t find an example of what I need to do!

    I’m trying to get a list of schools from a given district (in the example below, 00BA) and only return those which are NOT nurseries or preschools (identified by the Type “TypeOfEstablishment_EY_Setting”).

    I’ve managed to hack together the query below, but its returning duplicate results because most schools are listed under 2 Types, only one of which will be flagged as “TypeOfEstablishment_EY_Setting”.

    Am i meant to somehow re-filter what i’ve got so far, or can I limit the results returned initially?

    prefix sch-ont:
    prefix geo:
    prefix sch-type:
    SELECT ?school ?name ?date ?lat ?long ?capacity ?type WHERE {
    ?school a sch-ont:School;
    sch-ont:establishmentName ?name;
    sch-ont:openDate ?date;
    geo:lat ?lat;
    geo:long ?long;
    sch-type:type ?type;
    sch-ont:districtAdministrative
    .

    OPTIONAL {
    ?school sch-ont:schoolCapacity ?capacity
    }
    }

    1. I think some of the stuff in angle brackets must have got chopped out by the WordPress commenting filters.

      I haven’t seen the sch-type: prefix before – what does it refer to?

      Also, I haven’t seen TypeOfEstablishment_EY_Setting before? The list of settings I have seen all relate to the sch-ont: ontology ( http://education.data.gov.uk/ontology/school.rdf )

      Sorry I can’t be of any more help – I still haven’t got to grips with this stuff at all yet!

    2. “… but i’m running into difficulties because im’ learning by example, but can’t find an example of what I need to do!”

      Don;t I know that feeling…. how about trying the RESTful API? http://epimorph-pubx1.appspot.com/demo.html (If you scroll to the bottom of the page and click the “meta” link, you’ll see how to tweak the URL to get preview of the SPARQL query that is generated from the RESTful URL.)

      “I’ve managed to hack together the query below, but its returning duplicate results”
      I think there is a SELECT DISTINCT construction, if you have separate ?school ?type combinations, they would each get returned as separate results? DO you need to return the ?type ? Canlt you somehow just select items that are – or are not – a particular specified type? (My lack of confidence in/understanding of SPARQL means I can’t just write you the query!:-(

Comments are closed.