OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource?

When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).

In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google's Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”

I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.

But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.

As Steph Gray says in Cui bono? The problem with opening up data:

Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.

So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?

As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:

- Get the full university tables – as a spreadsheet
University research department rankings
Drop out rates for every university

Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)

The URI for the metapage that hosts a link to the RAE/research data blog post is:
http://www.guardian.co.uk/news/datablog+education/research,
and links to the teaching related posts is:
http://www.guardian.co.uk/news/datablog+education/higher-education.
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(

Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
http://www.guardian.co.uk/news/datablog+education/school-tables and
http://www.guardian.co.uk/news/datablog+education/primary-school-league-tables
and might naturally be expected to be linked to from:
http://www.guardian.co.uk/news/datablog+education/).

Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?

That is, do the HE data sets share common values, for instance in the way the HEIs are named?

If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?

Here’s a search for “Leeds”, for example:

One table contains items:

- Leeds Trinity & All Saints
– Leeds
– Leeds Met

and the other contains:

- Leeds College of Music
– The University of Leeds
– Leeds Metropolitan University
– Leeds Trinity and All Saints

So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).

This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.

So what does Redmond have to say about this (p.55)?

- Find and fix errors
Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).

(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)

Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)

So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?

PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.

PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:

Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.

Written by Tony Hirst

June 8, 2009 at 9:35 am

Posted in Data, Stirring, Thinkses

Tagged with , ,

8 Responses

Subscribe to comments with RSS.

  1. I agree Tony. I have been a little disappointed with the data they are putting up. I’d hoped to spend some lunchtimes messing around with their datasets, but have abandoned any hope of that. Why? Because, as you say, the data is so randomly named or collated. It’s simply not possible to quickly pull 2 tables from their tables and compare them. One would have to do a lot of data cleansing just to be able to “play” with it. And do I have the time to do that cleansing? No.

    I also concur that their inconsistent link naming is infuriating. If you say “DATA:” then don’t link to a different blog post!

    Andy Cotgreave

    June 8, 2009 at 10:14 am

  2. Mulling over this post whilst walking the dog just now, I remembered seeing the Guardian Style Guide [ http://www.amazon.co.uk/Guardian-Style-David-Marsh/dp/0852650868?tag=ouseful-21 ] in a bookshop in Leeds a couple of weeks ago.

    Among other things, the Guide identifies preferred spellings, clarifies grammatical forms, and so on. (We use a similar approach writing OU courses.)

    So it occurred to me that here in the world of data, more than anywhere, it’s essential to conform to good practice and stick to at least a preferred house style when compiling data tables.

    So here’s a question to the DataStore maintainers: do you have a data style guide yet?

    Tony Hirst

    June 8, 2009 at 2:12 pm

  3. [...] Here are thoughts from Tony Hirst, one of the first adopters and success stories for the Guardian’s Open Platform, on what the OP’s DataStore is and is not doing, in terms of data curation (or gardening). He asks: “Is the Guardian DataStore adding value to the data in the data store in an accessibility sense: by reducing the need for data mungers to have to process the data, so that it can be used in a plug’n'play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?” [...]

  4. [...] Here are thoughts from Tony Hirst, one of the first adopters and success stories for the Guardian’s Open Platform, on what the OP’s DataStore is and is not doing, in terms of data curation (or gardening). He asks: “Is the Guardian DataStore adding value to the data in the data store in an accessibility sense: by reducing the need for data mungers to have to process the data, so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?” [...]

  5. [...] worth pointing out that the Guardian Datastore is relatively new, and there are some issues, however, I think it’s a great step in the right direction, and another example of why the [...]

  6. [...] Alan Rusbriger explains why data and the Guardian Datastore matters, and for the sake of balance a dissenting voice. [...]

  7. [...] on the Guardian Datastore ires me even more… (I’ve written about this before (e.g. The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource?) so I’m not causing any new offence by saying [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 812 other followers

%d bloggers like this: