Plug’n'Play Public Data
Whilst listening to Radio 4′s Today programme this morning, I was pleasantly surprised to hear and interview with Hans Rosling about making stats’n'data relevant to Joe Public (you can find the interview, along with a video overview of the Gapminder software, here: Can statistics be beautiful?).
The last few weeks have seen the US Government getting into the data transparency business with the launch of data.gov whose purpose is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government”. The site offers access to a wide range of US Government datasets in a range of formats – XML, CSV, KML etc. (The site also gives links to widgets and other (parent?) sites that expose data.)
Providing URIs directly to CSV fils, for example, means that is is trivial to pull the data into online spreadsheets/databases, such as Google spreadsheets, or Dabble DB, or visualisation tools such as Many Eyes Wikified; and for smaller files, Yahoo Pipes provides a way of converting CSV or XML files to JSON that can be easily pulled in to a web page.
Realising that there may be some business in public data, Microsoft, Amazon and Google have all been sniffing around this area too: for Microsoft, it’s the Open Government Data Initiative (OGDI), for Amazon, it’s big data via Public Datasets on AWS, and for Google… well, Google. They bought Rosling’s Trendalyser, of course, and recently made a brief announcement about Public Data on Google, as well as Google Squared, which is still yet to be seen in public. With the publication of a Java support library for the Google Visualisation API open wire protocol/query language, you can see them trying to get their hooks into other people’s data. (The thing is, the query language is just so darned useful;-) Wolfram Alpha recently opened up their computational search over a wide range of curated data sets, and Yahoo? They’re trying to encourage people to make glue, I think, with YQL, YQL Execute and YQL Open Data Tables.
In the UK, we have the National Statistics website (I’m not even going to link to it, it’s that horrible..) as well as a scattered collection of resources as listed on the Rewired State: APIs wiki page; and, of course, the first steps of a news media curated datastore from the Guardian.
But maybe things are set to change? In a post on the Cabinet Office Digital Engagement blog, Information and how to make it useful, Richard (Stirling?) picks up on Recommendation 14 of the POIT (Power of Information Taskforce) Review Final Report, which states:
The government should ensure that public information data sets are easy to find and use. The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue). Prototypes should be running in 2009.
and proposes starting a conversation about “a UK version of data.gov”:
What characteristics would be most useful to you – feeds (ATOM or RSS) or bulk download by e.g. FTP, etc?
Should this be an index or a repository?
Should this serve particular types of data e.g. XML, JSON or RDF?
What examples should we be looking at (beyond data.gov e.g. http://ideas.welcomebackstage.com/data)?
Does this need it’s own domain, or should it sit on an existing supersite (e.g. http://direct.gov.uk)?
I posted my starter for 10 thoughts as a comment to that post (currently either spamtrapped, or laughed out of court), but there’s already some interesting discussion started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON or RSS” such as:
- Which data?
- Who decides whether to publish?
- Who benefits?
- Who pays?
- For how long?
My own stance is from a purely playful, and maybe even a little pragmatic, position: so what?
Someone like me… so a population of one, then… ;-)
So what do I know? I know how to cut and paste URLs in to things, and I know how to copy other peoples’ code and spot what bits I need to change so that it does “stuff with my stuff”.
I know that I can import CSV and Excel spreadsheets that are hosted online from their URL into Google spreadsheets, and from a URL as CSV into something like Dabble DB (which also lets me essentially merge data from two sources into a new data table). Yahoo Pipes also consumes CSV. I know that I can get CSV out of a Google spreadsheet or Dabble DB (or from a Yahoo pipe if CSV went in). I know that I can plot KML or geoRSS files on a Google map simply by pasting the URL into a Google map search box. I know I can get simple XML into a Google spreadsheet, and more general XML into a Yahoo Pipe. I know that YQL will also let me interrogate XML files and emit the results as XML or JSON. Pipes is good as emitting JSON too. (JSON is handy because you can pull it into a web page without requiring and help from script running on a server.) I’ve recently discovered that the Google Visualisation API query language and open wire protocol lets me run queries on datastores that support it, such as Google spreadsheets and Pachube. I know that Many Eyes Wikified will ingest CSV and then allow me to easily create a set of interactive visualisation
So what would I want from a UK version of data.gov, and why?
- CSV, XML and JSON output, with KML/GeoRSS where appropriate, keyed by a simple URI term;
- a sensible (i.e. a readable, hackable) URI pattern for extracting data: good examples are the BBC Programmes website and Google spreadsheets (e.g. where you can specify cell ranges);
- data available from a URI via an HTTP GET (not POST; GETable resources are easily pulled into other services, POST requested ones aren’t; don’t even think about SOAP;-);
- if possible, being able to query data or extract subsets of it: YQL and the Google Viz API query language show a possible way forward here. Supporting the Google open-wire protocol, or defining YQL open data tables for data sets brings the data into an environment where it can be interrogated or subsetted. (Pulling cell ranges from spreadsheets is only useful where the cells you want are contiguous.)
Although it pains me to suggest hooking into yet more of the Googleverse, a UK version of data.gov could do worse than support the Google visualization API open-wire protocol. Why? Well, for example, with only an hour or two’s coding, I was able to pull together a site that added a front end on to the Guardian datastore files on Google spreadsheets: First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (Okay, okay, I know – it shows that I only spent a couple of hours on it… but it was enough to demonstrate a sort of working rapid prototype…;-)
As to whether the data is useful, or who’s going to use it, or why they’re going to use it, I don’t know: but I suspect that if it isn’t easy to use, then people won’t. If one of the aims of data.gov style approaches is to engage people in conversations with data, we need to make it easy for them. Essentially, we want people to engage in – not quite ‘enterprise mashups’, more civic mashups. I’m not sure who these people are likely to be – activitists, policy wonks, journalists, concrned citizens, academics, students – but they’re probably not qualified statisticians with a blackbelt in R or SPSS.
So for example, even the Guardian datastore data is quite hard to play with for most people (it’s just a set of spreadsheets, right? So what can I actually do with them?). In contrast, the New York Times Visualization Lab folks have started looking at making it easier for readers to intrrogate the data in a visual way with Many Eyes Wikified, which is one reason I started trying to think about what a query’n'visualisation API to the Guardian datastore might look like…
PS just in case the Linked Data folks feel left out, I still think RDF and semweb geekery is way too confusing for mortals. Things like SPARCool are starting to help, but IMHO it’s still way too quirky syntactic for a quick hit… SQL and SQL like languages are hard enough, especially when you bear in mind that most people don’t know (or care) that advanced search exists on web search engines, let alone what it does or how to use it.
PPS see also National Research Council Canada: Gateway to Scientific Data (via Lorcan Dempsey).