The Problem With Linked Data is That Things Don’t Quite Link Up

A v. quick post this one, because I have other stuff that really needs to be done, but it’s something I want to record as another couple of observations around the practical difficulties of engaging with Linked Data…

Firstly, identifiers for things most of us would probably call councils. The Guardian Datablog has just published data/details of the local council cuts. The associated Datastore Spreadsheet has a column containing council identifiers, as well as the council names:

Datastore spreadsheet - council cuts

Adding formal identifiers such as these is something I keep hassling Simon Rogers and the @datastore team about, so it’s great to see the use of a presumably standardised identifier there:-) Only – I can’t see how to join it up to any of the other myriad identifiers that seem to exist for council areas?

So for example, looking up Trafford on the National Statistics Linked Data endpoint identifies it as local-authority-district/00BU and Local education authority 358 – I can’t find R342 anywhere? Nor does R342 appear as an identifier on the OpenlyLocal page for Trafford Council, which is another default place I go to look at for bridging/linking information (but then, maybe a local authority is not a council?)

(A use case for the data might be taking the codes and using them to colour areas on an Ordnance Survey OpenSpace map (ans. 1.17)… This requires a bridge into the namespaces the OS mapping tools recognise.)

I can google “Trafford R342” and find a couple of other references to this association, but I can’t find a way of linking to entities I know about in the Linked Data world?

But then, maybe the R*** areas don’t match any of the administrative areas that are recorded in any of the other data soruces I found…?

So I have an identifier, but I don’t know what it actually refers to/links to, and I donlt know how to make use of it?

And then there’s a second related problem – a mismatch between popular understanding of a term/concept, and it’s formal use in a defined ontology, which can cause all sorts of problems when naively trying to make use of formally defined data…

Take for example, the case of counties. Following a brief Twitter exchange this morning with the ever helpful @gothwin, it turns out that if you live in somewere like Southampton (or another unitary authority or metropolitan district), you don’t live in a county… (for example – compare the Ordnance Survey pages for postcode areas SO16 4GU and EX1 1HD). The notion of counties is apparently just a folk convention now, although the Association of British Counties is trying to “promote awareness of the continuing importance of the 86 historic (or traditional) Counties of Great Britain… contend[ing] that Britain needs a fixed popular geography, one divorced from the ever changing names and areas of local government but, instead, one rooted in history, public understanding and commonly held notions of cultural identity.” Which is why they “seek to fully re-establish the use of the Counties as the standard popular geographical reference frame of Britain and to further encourage their use as a basis for social, sporting and cultural activities”. (@gothwin did hint that OS might be “look[ing] at publishing a ‘people’s geography’ with traditional counties”.

As it is, for a naive developer, (or random tinkerer, such as myself), struggling to get to grips with the mechanics of Linked Data, it seems that to make any use at all of government Linked Data, you also need a pretty good grasp of the data models before you randomly try hacking together queries or linking stuff together, as the nightmare exposure I had to COINS Linked Data suggests… ;-)

In other words, there are at least two major barriers to entry to using government Linked Data: on the one hand, there’s getting comfortable enough with things like SPARQL to be able to navigate Linked Data datasets and put together sensible queries (the technical problem); on the other hand, there’s understanding the data model and the things it models well enough to articulate even natural language questions that might be asked of a dataset (a domain expertise problem). (And as we try to link across datasets, the domain expertise problem just compounds?) Then all that remains is mapping the natural language query onto the formal query, given the definitions of ontologies being used…

(I know, I know – it’s always rash to query data you don’t understand… but I think a point I’m trying to make is that getting your head round Linked Data is made doubly difficult when things don’t work not because of the way you’ve written the query, but because you don’t understand the way the data has been modeled… (which ends up meaning it is a problem with the way you wrote the query, just not the way you thought…!))

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

12 thoughts on “The Problem With Linked Data is That Things Don’t Quite Link Up”

  1. You raise some important issues here. Although we have OWL ontologies to accompany our data I think they are probably just as impenetrable as the data. It’s a tricky balance between putting too much and too little information in there.

    One thing that is in the linked data improvement plan at OS is better documentation and possibly simple UMLesque diagrams. Would that be useful to you?

    What sort of things could we do to the OS site to make your linked data experience a more pleasant one?

    1. I think carefully worked demos that work as tutorials can help, particularly when they contain code that can be pinched to get folk up and running as quickly as possible (I can see what it does, but don’t know how it does it… i.e. good enough for reuse right now.)

      As far as the OS stuff goes, I was wondering whether a treemap or similar style visualisation might go some way towards helping folk see how the different areas relate to each other? Yes, there are exceptions, but sometimes there’s no point trying to explain the full complexity of a data model to someone who prefers to build up their understanding about the exceptions from simple beginnings…

      This is one reason I try to blog stuff as quickly as possible when I’m struggling with something… in retrospect, things may seem obvious, but by logging my own learning journey it can act as a useful reference for creating teaching/learning materials in the future, as well as leaving a learning trail for others who are similarly struggling at the current time!

      Sort of related to this is the SEO aspect (as hinted at in https://blog.ouseful.info/2010/12/13/teaching-answers/ ) and the way in which we make guidance materials discoverable. As an expert, it’s easy to write stuff intended as tutorial intros in expert, rather than naive, language. To illustrate this, I quote examples of course catalogue pages that describe a course using the language that you would expect to have learned coming out of the course, not the language you’re likely to use on the way in (SEO should in part market courses you want to take, not what you have taken…;-)

  2. I agree Tony. It completely discourages the open data hacker/tinkerer. When I want to match some data, I want to get onto exploring the data quickly. I don’t have the time to spend cleaning the data or finding the identifiers.

    I’d say that about 80% of times I see some open data, and think of a really exciting way of mixing it with other data, I don’t bother because the IDs are wrong/missing.

    [I also saw this lots with the Guardian’s PA ID for Constituencies during the election – I never did find a decent list of what was what.]

    When I did some stuff for the Open Data Hackathon the other week, I matched arts funding data with a latitude/longitude dataset – fortunately the names of the constituencies matched perfectly so I didn’t have to do any cleansing. Details here:
    http://www.thedatastudio.co.uk/blog/the-data-studio-blog/andy-cotgreave/open-data-hackathon-oxford

    1. I think something that would really help with the uptake of data are micro-demos of how to link from one data set to another whenever data is published. It doesn’t have to be a full blown app, just a demonstration of how two take two technical Lego bricks and show the different ways they can be connected together…

    1. Yes – agreed – the two posts you used were useful (though I did get flummoxed trying to use bits of the location/location/location post to find a county for a SOuthampton postcode…;-)

  3. The challenge with demos is that one-size-doesn’t-fit-all.

    If all datasets had proper IDs, then we could demos in any of our favoured tools, be they Tableau, MySQL, R, Yahoo Pipes, or whatever.

    However, how could you ever demo linking the council data that this post started off with? You would need to know how to search for potential ID lists. You’d need to know where to search for them. You’d need pretty advanced skills in Excel or a database technology. Most of the skills required to match this kind of data can only be learnt by experience. I can’t see how you could encapsulate that kind of thing in a step-by-step demo.

    [I’m not meaning to be a killjoy, mind!]

    1. In current example, all I wanted to do was show I could get something out of a UK gov linked datastore using R342 as a component in a SPARQL query onto a gov endpoint… ;-)

  4. All very chicken and egg isn’t it. The LD web feels like the web did before we had any tools or decent search engines. You need to get some data out there in order to build stuff, but it’s hard to build stuff without the tools.

    But then people like you experimenting inform developers as to what kinds of tools/services are needed to make it easy.

  5. I certainly agree with the need to make the models clear, choose simple models where possible, support them with documentation and examples. Also to use things like the Linked Data API and visualizations over it to make it possible to explore and query lilnked-data and grok the model, without having to learn SPARQL.

    But …

    The post is mis-titled. Isn’t the first part of your story a great example of where linked data would help? You have a spreadsheet which uses internal codes which no one else seems to use. If instead it had been published as linked data, or if someone else out there understands R* codes and can republish as linked data, reusing council URIs then it seems to me that would be a great help for what you are trying to do.

    Dave

  6. Hoping someone can help.

    I’ve got a password protected page where I want to display a list of people attending a meeting. The list is in the google spreadsheet and my intention is to export in CSV format and display it in a text widget.

    I want to show Name and Company and then the details below.

    Can anyone help?

    I saw the idea here

    http://cogdogblog.com/2009/08/31/google-spreadsheets/

    But just can’t seem to get it to work :/

Comments are closed.

%d bloggers like this: