When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).
In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google’s Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”
I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.
But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.
As Steph Gray says in Cui bono? The problem with opening up data:
Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.
So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?
As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:
Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)
The URI for the metapage that hosts a link to the RAE/research data blog post is:
and links to the teaching related posts is:
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(
Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
– http://www.guardian.co.uk/news/datablog+education/school-tables and
and might naturally be expected to be linked to from:
Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?
That is, do the HE data sets share common values, for instance in the way the HEIs are named?
If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?
Here’s a search for “Leeds”, for example:
One table contains items:
– Leeds Trinity & All Saints
– Leeds Met
and the other contains:
– Leeds College of Music
– The University of Leeds
– Leeds Metropolitan University
– Leeds Trinity and All Saints
So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).
This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.
So what does Redmond have to say about this (p.55)?
– Find and fix errors
– Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).
(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)
Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)
So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?
PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.
PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:
Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.