The Guardian OpenPlatform DataStore – Just a Toy, or a Trusted Resource?

When the Guardian launched their OpenPlatform DataStore, a collection of public data, curated by Guardian folk, hosted on Google Spreadsheets, it raised the question as to whether this initiative would influence the attitude of the Office of National Statistics, and in particular the way they publish their results (e.g. Guardian Data Store: threat to ONS or its saviour?).

In the three sexy skills of data geeks, Michael Driscoll reinterprets Google’s Chief Economist’s prediction that “the sexy job in the next ten years will be statisticians… The ability to take data—to be able to understand it, to process it, to extract value from it, to visualize it, to communicate it” with his belief that “the folks to whom Hal Varian i.e. [Google’s Chief Economist] is referring are not statisticians in the narrow sense, but rather people who possess skills in three key, yet independent areas: statistics [studying], data munging [suffering] and data visualization [storytelling]”

I’ve already suggested that what I’d quite like to see is plug’n’play public data that’s easy for people to play with in a variety of ways, and publishing it via Google Spreadsheets certainly lowers quite a few barriers to entry from a technical perspective which can make life easier for statisticians and the visualisers, and reduce the need for the data mungers, the poor folks who go through “the painful process of cleaning, parsing, and proofing one’s data before it’s suitable for analysis. Real world data is messy” as well as providing access to data where it is difficult to access: “related to munging but certainly far less painful is the ability to retrieve, slice, and dice well-structured data from persistent data stores”.

But if you don’t take care of the data you’re publishing, the even though there are friendly APIs to the data it doesn’t necessarily follow that the data will be useful.

As Steph Gray says in Cui bono? The problem with opening up data:

Here’s my thought: open data needs a new breed of data gardeners – not necessarily civil servants, but people who know data, what it means and how to use it, and have a role like the editors of Wikipedia or the mods of a busy forum in keeping it clean and useful for the rest of us. … Support them with some data groundsmen with heavy-lifting tools and technical skills to organise, format, publish and protect large datasets.

So with all that in mind, is the Guardian DataStore adding value to the data in the data store in an accessibility sense by reducing the need for data mungers to have to process the data so that it can be used in a plug’n’play way by the statisticians and the data visualisers, whether they’re professionals, amateurs or good old Jo Public?

As a way in to this question, let’s look at the various HE datasets. The Guardian has published several of these:

Get the full university tables – as a spreadsheet
University research department rankings
Drop out rates for every university

Before we look at the data, though, let’s look at the URIs to see if the architecture of the site makes it easy to discover potentially related datasets. (Finding data is another of the skill related to the black arts of the data mungers, I think?!;-)

The URI for the metapage that hosts a link to the RAE/research data blog post is:
http://www.guardian.co.uk/news/datablog+education/research,
and links to the teaching related posts is:
http://www.guardian.co.uk/news/datablog+education/higher-education.
Going back up the common path to http://www.guardian.co.uk/news/datablog+education/ we get…. a 404 :-(

Hmmm… So how come the datablog+education page doesn’t link down to the HE collection pages, as wll as the schools data blog pages (e.g. these are both valid:
http://www.guardian.co.uk/news/datablog+education/school-tables and
http://www.guardian.co.uk/news/datablog+education/primary-school-league-tables
and might naturally be expected to be linked to from:
http://www.guardian.co.uk/news/datablog+education/).

Looking back to the HE teaching related datasets, we see they are both listed on the http://www.guardian.co.uk/news/datablog+education/higher-education page. So might we then expect them to be ‘compatible’ datasets in some sense?

That is, do the HE data sets share common values, for instance in the way the HEIs are named?

If we generate a couple of queries on to the university satisfaction tables and the dropout tables (maybe trying to look for correlations between drop out rate and student satisfaction) by pulling the results from different queries on those tables in to a data grid within a Google spreadsheet (cf. the approach taken in Using Google Spreadsheets and Viz API Queries to Roll Your Own Data Rich Version of Google Squared on Steroids (Almost…)), what do we gt?

Here’s a search for “Leeds”, for example:

One table contains items:

– Leeds Trinity & All Saints
– Leeds
– Leeds Met

and the other contains:

– Leeds College of Music
– The University of Leeds
– Leeds Metropolitan University
– Leeds Trinity and All Saints

So already, even with quite a young datastore, we have an issue with data quality. In Data Driven: Profiting from Your Most Important Business Asset, Thomas Redman identifies “seven common data quality issues) which include the related problems of too much data (i.e. multiple copies of the same data in different places – that is, redundancy) and data inconsistency across sources (not a problem the datastore suffers from – yet?) and poor data definition (p41 -preview available on Google books?).

This latter issue, poor data definition, is evident in the naming of the HEI institutions above: I can’t simply import the overall tables and dropout tables into DabbleDB and let it magically create a combined table based on common (i.e. canonical) HEI names (using the approach described in Mash/Combining Data from Three Separate Sources Using Dabble DB), for example) because the HEIs don’t have common names.

So what does Redmond have to say about this (p.55)?

Find and fix errors
Prevent them at their source [in this case, the error is inconsistency and could have been prevented by using a common HEI naming scheme, OR providing another unique identifier that could act as a key across multiple data tables; but name is easier – because name is what people are likely to search by…).

(See also Redmond’s “Hierarchy of Data and Information Needs”, (p. 58), which identifies the need for consistency across sources.)

Note that we have a problem though – the datastore curators can’t change the names in the current spreadsheets, because people may already be using them and keying on the current name format. We shouldn’t create another spreadsheet containing the same data because that causes duplication/redundancy? So what would be the best approach? Answers on the back of a postcard to, err, the Guardian datastore, I guess?!;-)

So is it the Guardian’s job to be curating this data, or tending it as one of Steph’s data gardeners/groundsmen might? If they want it to be a serious resource, then I would say so. But if it’s just a toy? Well, who cares…?

PS Just in passing, what other value might the DataStore add to spreadsheets to make them more amenable to “mashups”? For data like the university data, providing geo-data might be interesting (even at the crude level of just providing a single geographical co-ordinate for the central location of the institution). If I could easily get geo-data for the HEIs, and combine it with the satisfaction tables or dropout rates, it would be trivial to generate map based views of the data.

PPS one other gripe I have with the Guardian datablog, where many of the datastore data sets are announced, is that the links are misleading:

Now call me naive, but I’d expect those DATA links to point to spreadsheets, as indeed the first two do, but the third points to another blog post and so I’ve lost trust in being able to use those DATA links (e.g. in a screenscraper) as a direct pointer to a spreadsheet.

Chasing Data – Are You Datablogging Yet?

It’s strange to think that the web search industry is only 15 years or so old, and in that time the race has been run on indexing and serving up results for web pages, images, videos, blogs, and so on. The current race is focused on chasing the mobile (local) searcher, making use of location awareness to serve up ads that are sensitive to spatial context, but maybe it’s data that is next?

(Maybe I need to write a “fear post” about how we’re waking into a world with browsers that know where we are, rather than “just” GPS enabled devices and mobile phone cell triangulation? ;-) [And, err, it seems Microsoft are getting in there too: Windows 7 knows where you are – “So just what is it that Microsoft is doing in Windows 7? Well, at a low level, Microsoft has a new application programming interface (API) for sensors and a second API for location. It uses any of a number of things to actually get the location, depending on what’s available. Obviously there’s GPS, but it also supports Wi-Fi and cellular triangulation. At a minimum.”]

So… data. Take for example this service on the Microsoft Research site: Data Depot. To me, this looks a site that will store and visualiise your telemetry data, or more informally collected data (you can tweet in data points, for example):

Want to ‘datablog’ your running miles or your commute times or your grocery spending? DataDepot provides a simple way to track any type of data over time. You can add data via the web or your phone, then annotate, view, analyze, and add related content to your data.

Services like Trendrr have also got the machinery in place to take daily “samples” and produce trend lines over time from automatically collected data. For example, here are some of the data sources they can already access:

  • Weather details – High and the low temperatures on weather.com for a specific zipcode.
  • Amazon Sales Rank – Sales rank on amazon.com
  • Monster Job Listings – Number of job results from Monster.com for the given query in a specific city.

Now call me paranoid, but I suddenly twigged why I thought the Google announcement about an extension to the Google Visualisation API that will enabl[e] developers to display data from any data source connected to the web (any database, Excel spreadsheet, etc.), not just from Google Spreadsheets could have some consequences.

At the moment, the API will let you pull datatable formatted data from your database into the Google namespace. But suppose the next step is for the API to make a call on your database using a query you have handcrafted; then add in some fear that Google has already sussed out how to Crawl through HTML forms by parsing a form and then automatically generating and posting queries using those forms to find more links from deep within a website, and you can see how giving the Google API a single query on your database would tell them some “useful info” (?!;-) about your database schema – info they could use to scrape and index a little more data out of your database…

Now of course the Viz API service may never extend that far, and I’m sure Google’s T&C’s would guarantee “good Internet citizenry practices”, but the potential for evil will be there…

And finally, it’s probably also worth mentioning that even if we don’t give the Goog the keys to our databases, plenty of us are in the habit of feeding public data stores anyway. For example, there are several sites built specifically around visualising user submitted data, (if you make it public…): Many Eyes and Swivel, for example. And then of course, there’s also Google Spreadsheets, DabbleDB, Zoho sheet etc etc.

The race for data is on… what are the consequences?!;-)

PS see also Track’n’graph, iCharts and widgenie. Or how about Daytum and mycrocosm.

Also related: “Self-surveillance”.