Handling RDF on Your Own System – Quick Start

One of the things that I think tends towards being a bit of an elephant in the Linked Data room is the practically difficulty of running a query that links together results from two different datastores, even if they share common identifiers. The solution – at the moment at least – seems to require grabbing a dump of both datastores, uploading them to a common datastore and then querying that…

…which means you need to run your own triple store…

This quick post links out the the work of two others, as much as a placeholder for myself as for anything, describing how to get started doing exactly that…

First up, John Goodwin, aka @gothwin, (a go to person if you ever have dealings with the Ordnance Survey Linked Data) on How can I use the Ordnance Survey Linked Data: a python rdflib example. As John describes it:

[T]his post shows how you just need rdflib and Python to build a simple linked data mashup – no separate triplestore is required! RDF is loaded into a Graph. Triples in this Graph reference postcode URIs. These URIs are de-referenced and the RDF behind them is loaded into the Graph. We have now enhanced the data in the Graph with local authority area information. So as well as knowing the postcode of the organisations taking part in certain projects we now also know which local authority area they are in. Job done! We can now analyse funding data at the level of postcode, local authority area and (as an exercise for the ready) European region.

Secondly, if you want to run a fully blown triple store on your own localhost, check out this post from Jeni Tennison, aka @jenit, (a go to person if you’re using the data.gov.uk Linked Datastores, or have an interest in the Linked Data JSON API): Getting Started with RDF and SPARQL Using 4store and RDF.rb, which documents how to get started on the following challenges (via Richard Pope’s Linked Data/RDF/SPARQL Documentation Challenge):

Install an RDF store from a package management system on a computer running either Apple’s OSX or Ubuntu Desktop.
Install a code library (again from a package management system) for talking to the RDF store in either PHP, Ruby or Python.
Programatically load some real-world data into the RDF datastore using either PHP, Ruby or Python.
Programatically retrieve data from the datastore with SPARQL using using either PHP, Ruby or Python.
Convert retrieved data into an object or datatype that can be used by the chosen programming language (e.g. a Python dictionary).

PS it may also be worth checking out these posts from Kingsley Idehen:
SPARQL Guide for the PHP Developer
SPARQL Guide for Python Developer
SPARQL Guide for the Javascript Developer
SPARQL for the Ruby Developer

First Signs (For Me) of Linked Data Being Properly Linked…?!

As anyone who’s followed this blog for some time will know, my relationship with Linked Data has been an off and on again one over the years. At the current time, it’s largely off – all my OpenRefine installs seem to have given up the ghost as far as reconciliation and linking services go, and I have no idea where the problem lies (whether with the plugins, the installs, with Java, with the endpoints, with the reconciliations or linkages I’m trying to establish).

My dabblings with pulling data in from Wikipedia/DBpedia to Gephi (eg as described in Visualising Related Entries in Wikipedia Using Gephi and the various associated follow-on posts) continue to be hit and miss due to the vagaries of DBpedia and the huge gaps in infobox structured data across Wikipedia itself.

With OpenRefine not doing its thing for me, I haven’t been able to use that app as the glue to bind together queries made across different Linked Data services, albeit in piecemeal fashion. Because from the occasional sideline view I have of the Linked Data world, I haven’t seen any obvious way of actually linking data sets other than by pulling identifiers in to a new OpenRefine column (or wherever) from one service, then using those identifiers to pull in data from another endpoint into another column, and so on…

So all is generally not well.

However, a recent post by the Ordnance Survey’s John Goodwin (aka @gothwin) caught my eye the other day: Federating SPARQL Queries Across Government Linked Data. It seems that federated queries can now be made across several endpoints.

John gives an example using data from the Ordnance Survey SPARQL endpoint and an endpoint published by the Environment Agency:

The Environment Agency has published a number of its open data offerings as linked data … A relatively straight forward SPARQL query will get you a list of bathing waters, their name and the district they are in.

[S]uppose we just want a list of bathing water areas in South East England – how would we do that? This is where SPARQL federation comes in. The information about which European Regions districts are in is held in the Ordnance Survey linked data. If you hop over the the Ordnance Survey SPARQL endpoint explorer you can run [a] query to find all districts in South East England along with their names …

Using the SERVICE keyword we can bring these two queries together to find all bathing waters in South East England, and the districts they are in:

And here’s the query John shows, as run against the Ordnance Survey SPARQL endpoint

SELECT ?x ?name ?districtname WHERE {
  ?x a <http://environment.data.gov.uk/def/bathing-water/BathingWater> .
  ?x <http://www.w3.org/2000/01/rdf-schema#label> ?name .
  ?x <http://statistics.data.gov.uk/def/administrative-geography/district> ?district .
  SERVICE <http://data.ordnancesurvey.co.uk/datasets/boundary-line/apis/sparql>
    ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within> <http://data.ordnancesurvey.co.uk/id/7000000000041421> .
    ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  }
} ORDER BY ?districtname

In a follow on post, John goes even further “by linking up data from Ordnance Survey, the Office of National Statistics, the Department of Communities and Local Government and Hampshire County Council”.

So that’s four endpoints – the original one against which the query is first fired, and three others…

SELECT ?districtname ?imdrank ?changeorder ?opdate ?councilwebsite ?siteaddress WHERE {
  ?district <http://data.ordnancesurvey.co.uk/ontology/spatialrelations/within <http://data.ordnancesurvey.co.uk/id/7000000000017765> .
  ?district a <http://data.ordnancesurvey.co.uk/ontology/admingeo/District> .
  ?district <http://www.w3.org/2000/01/rdf-schema#label> ?districtname .
  SERVICE <http://opendatacommunities.org/sparql> {
    ?s <http://purl.org/linked-data/sdmx/2009/dimension#refArea> ?district .
    ?s <http://opendatacommunities.org/def/IMD#IMD-rank> ?imdrank . 
    ?authority <http://opendatacommunities.org/def/local-government/governs> ?district .
    ?authority <http://xmlns.com/foaf/0.1/page> ?councilwebsite .
  }
  ?district <http://www.w3.org/2002/07/owl#sameAs> ?onsdist .
  SERVICE <http://statistics.data.gov.uk/sparql> {
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/originatingChangeOrder> ?changeorder .
    ?onsdist <http://statistics.data.gov.uk/def/boundary-change/operativedate> ?opdate .
  }
  SERVICE <http://linkeddata.hants.gov.uk/sparql> {
    ?landsupsite <http://data.ordnancesurvey.co.uk/ontology/admingeo/district> ?district .
    ?landsupsite a <http://linkeddata.hants.gov.uk/def/land-supply/LandSupplySite> .
    ?landsupsite <http://www.ordnancesurvey.co.uk/ontology/BuildingsAndPlaces/v1.1/BuildingsAndPlaces.owl#hasAddress> ?siteaddress .
  }
}

Now we’re getting somewhere….

From Linked Data to Linked Applications?

Pondering how to put together some Docker IPython magic for running arbitrary command line functions in arbitrary docker containers (this is as far as I’ve got so far), I think the commands must include a couple of things:

  1. the name of the container (perhaps rooted in a particular repository): psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example;
  2. the actual command to be called: for example, one of the contentine commands: getpapers -q {QUERY} -o {OUTPUTDIR} -x

We might also optionally specify mount directories with the calling and called containers, using a conventional default otherwise.

This got me thinking that the called functions might be viewed as operating in a namespace (psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example). And this in turn got me thinking about “big-L, big-A” Linked Applications.

According to Tim Berners Lee’s four rules of Linked Data, the web of data should:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

So how about a web of containerised applications, that would:

  1. Use URIs as names for container images
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information (in the minimal case, this corresponds to a Dockerhub page for example; in a user-centric world, this could just return a help file identifying the commands available in the container, along with help for individual commands; )
  4. Include a Dockerfile. so that they can discover what the application is built from (also may link to other Dockerfiles).

Compared with Linked Data, where the idea is about relating data items one to another, the identifying HTTP URI actually represents the ability to make a call into a functional, execution space. Linkage into the world of linked web resources might be provided through Linked Data relations that specify that a particular resource was generated from an instance of a Linked Application or that the resource can be manipulated by an instance of a particular application.

So for example, files linked to on the web might have a relation that identifies the filetype, and the filetype is linked by another relation that says it can be opened in a particular linked application. Another file might link to a description of the workflow that created it, and the individual steps in the workflow might link to function/command identifiers that are linked to linked applications through relations that associate particular functions with a particular linked application.

Workflows may be defined generically, and then instantiated within a particular experiment. So for example: load file with particular properties, run FFT on particular columns, save output file becomes instantiated within a particular run of an experiment as load file with this URI, run the FFT command from this linked application on particular columns, save output file with this URI.

Hmm… thinks.. there is a huge amount of work already done in the area of automated workflows and workflow execution frameworks/environments for scientific computing. So this is presumably already largely solved? For example, Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker, C. Zheng & D. Thain, 2015 [PDF]?

A handful of other quick points:

  • the model I’m exploring in the Docker magic context is essentially stateless/serverless computing approach, where a commandline container is created on demand and treated in a disposable way to just run a particular function before being destroyed; (see also the OpenAPI approach).
  • The Linked Application notion extends to other containerised applications, such as ones that expose an HTML user interface over http that can be accessed via a browser. In such cases, things like WSDL (or WADL; remember WADL?) provided a machine readable formalised way of describing functional resource availability.
  • In the sense that commandline containerised Linked Applications are actually services, we can also think about web services publishing an http API in a similar way?
  • services such as Sandstorm, which have the notion of self-running containerised documents, have the potentially to actually bind a specific document within an interactive execution environment for that document.

Hmmm… so how much nonsense is all of the above, then?