Plug’n’Play Public Data

Whilst listening to Radio 4’s Today programme this morning, I was pleasantly surprised to hear and interview with Hans Rosling about making stats’n’data relevant to Joe Public (you can find the interview, along with a video overview of the Gapminder software, here: Can statistics be beautiful?).

The last few weeks have seen the US Government getting into the data transparency business with the launch of whose purpose is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government”. The site offers access to a wide range of US Government datasets in a range of formats – XML, CSV, KML etc. (The site also gives links to widgets and other (parent?) sites that expose data.)

Providing URIs directly to CSV fils, for example, means that is is trivial to pull the data into online spreadsheets/databases, such as Google spreadsheets, or Dabble DB, or visualisation tools such as Many Eyes Wikified; and for smaller files, Yahoo Pipes provides a way of converting CSV or XML files to JSON that can be easily pulled in to a web page.

Realising that there may be some business in public data, Microsoft, Amazon and Google have all been sniffing around this area too: for Microsoft, it’s the Open Government Data Initiative (OGDI), for Amazon, it’s big data via Public Datasets on AWS, and for Google… well, Google. They bought Rosling’s Trendalyser, of course, and recently made a brief announcement about Public Data on Google, as well as Google Squared, which is still yet to be seen in public. With the publication of a Java support library for the Google Visualisation API open wire protocol/query language, you can see them trying to get their hooks into other people’s data. (The thing is, the query language is just so darned useful;-) Wolfram Alpha recently opened up their computational search over a wide range of curated data sets, and Yahoo? They’re trying to encourage people to make glue, I think, with YQL, YQL Execute and YQL Open Data Tables.

In the UK, we have the National Statistics website (I’m not even going to link to it, it’s that horrible..) as well as a scattered collection of resources as listed on the Rewired State: APIs wiki page; and, of course, the first steps of a news media curated datastore from the Guardian.

But maybe things are set to change? In a post on the Cabinet Office Digital Engagement blog, Information and how to make it useful, Richard (Stirling?) picks up on Recommendation 14 of the POIT (Power of Information Taskforce) Review Final Report, which states:

Recommendation 14
The government should ensure that public information data sets are easy to find and use. The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue). Prototypes should be running in 2009.

and proposes starting a conversation about “a UK version of”:

What characteristics would be most useful to you – feeds (ATOM or RSS) or bulk download by e.g. FTP, etc?
Should this be an index or a repository?
Should this serve particular types of data e.g. XML, JSON or RDF?
What examples should we be looking at (beyond e.g.
Does this need it’s own domain, or should it sit on an existing supersite (e.g.

I posted my starter for 10 thoughts as a comment to that post (currently either spamtrapped, or laughed out of court), but there’s already some interesting discussion started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON or RSS” such as:

– Which data?
– Who decides whether to publish?
– Who benefits?
– Who pays?
– For how long?

My own stance is from a purely playful, and maybe even a little pragmatic, position: so what?

There are quite a few ways of interpreting this question of course, but the direction I’ll come at it (in this post at least) is in terms of use by people whose job it isn’t…

Someone like me… so a population of one, then… ;-)

So what do I know? I know how to cut and paste URLs in to things, and I know how to copy other peoples’ code and spot what bits I need to change so that it does “stuff with my stuff”.

I know that I can import CSV and Excel spreadsheets that are hosted online from their URL into Google spreadsheets, and from a URL as CSV into something like Dabble DB (which also lets me essentially merge data from two sources into a new data table). Yahoo Pipes also consumes CSV. I know that I can get CSV out of a Google spreadsheet or Dabble DB (or from a Yahoo pipe if CSV went in). I know that I can plot KML or geoRSS files on a Google map simply by pasting the URL into a Google map search box. I know I can get simple XML into a Google spreadsheet, and more general XML into a Yahoo Pipe. I know that YQL will also let me interrogate XML files and emit the results as XML or JSON. Pipes is good as emitting JSON too. (JSON is handy because you can pull it into a web page without requiring and help from script running on a server.) I’ve recently discovered that the Google Visualisation API query language and open wire protocol lets me run queries on datastores that support it, such as Google spreadsheets and Pachube. I know that Many Eyes Wikified will ingest CSV and then allow me to easily create a set of interactive visualisation

So what would I want from a UK version of, and why?

– CSV, XML and JSON output, with KML/GeoRSS where appropriate, keyed by a simple URI term;
– a sensible (i.e. a readable, hackable) URI pattern for extracting data: good examples are the BBC Programmes website and Google spreadsheets (e.g. where you can specify cell ranges);
– data available from a URI via an HTTP GET (not POST; GETable resources are easily pulled into other services, POST requested ones aren’t; don’t even think about SOAP;-);
– if possible, being able to query data or extract subsets of it: YQL and the Google Viz API query language show a possible way forward here. Supporting the Google open-wire protocol, or defining YQL open data tables for data sets brings the data into an environment where it can be interrogated or subsetted. (Pulling cell ranges from spreadsheets is only useful where the cells you want are contiguous.)

Although it pains me to suggest hooking into yet more of the Googleverse, a UK version of could do worse than support the Google visualization API open-wire protocol. Why? Well, for example, with only an hour or two’s coding, I was able to pull together a site that added a front end on to the Guardian datastore files on Google spreadsheets: First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (Okay, okay, I know – it shows that I only spent a couple of hours on it… but it was enough to demonstrate a sort of working rapid prototype…;-)

As to whether the data is useful, or who’s going to use it, or why they’re going to use it, I don’t know: but I suspect that if it isn’t easy to use, then people won’t. If one of the aims of style approaches is to engage people in conversations with data, we need to make it easy for them. Essentially, we want people to engage in – not quite ‘enterprise mashups’, more civic mashups. I’m not sure who these people are likely to be – activitists, policy wonks, journalists, concrned citizens, academics, students – but they’re probably not qualified statisticians with a blackbelt in R or SPSS.

So for example, even the Guardian datastore data is quite hard to play with for most people (it’s just a set of spreadsheets, right? So what can I actually do with them?). In contrast, the New York Times Visualization Lab folks have started looking at making it easier for readers to intrrogate the data in a visual way with Many Eyes Wikified, which is one reason I started trying to think about what a query’n’visualisation API to the Guardian datastore might look like…

PS just in case the Linked Data folks feel left out, I still think RDF and semweb geekery is way too confusing for mortals. Things like SPARCool are starting to help, but IMHO it’s still way too quirky syntactic for a quick hit… SQL and SQL like languages are hard enough, especially when you bear in mind that most people don’t know (or care) that advanced search exists on web search engines, let alone what it does or how to use it.

PPS see also National Research Council Canada: Gateway to Scientific Data (via Lorcan Dempsey).

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

9 thoughts on “Plug’n’Play Public Data”

  1. Following on from your question, ‘what do I know’, I think it would be useful for readers of your blog (i.e. me, at least) if you did two things:

    1) Offer more bio info about yourself. Your experience that brought you to do and write about this stuff. When you say, ‘I know how to copy and paste URLs into things…’, I read that as ‘you, the reader, should know how to copy and paste URLs into things…’. Sometimes, I read your posts and think, ‘that’s all very well, but to make this work for me, I’m going to have to learn SQL, Regular Expressions and understand the principles of RESTful architecture’. Knowing a bit more about where you are coming from in terms of skills, would provide useful context, I think.

    2) So, with this in mind, (and as I alluded to in a tweet over the weekend), along side the ‘how to mashup’ posts, I think it would be both interesting and really useful, to have an OUseful skills curriculum. A page or category of posts, that lays out the basic set of skills that readers should be working on if we’re interested in contributing to the type of (good) work you’re doing. Links to quality resources and tutorials elsewhere would be really complement your (educational) work, I think. Many of your posts are well-structured tutorials but I feel like they’re written for people that have been with you from the start and there’s nowhere for the new reader to get up to speed with both who is writing and the skills you’re assuming readers should work on in parallel to following your tutorials.

  2. Hi,

    The Talis Connected Commons scheme provides a free service for hosting public data:

    If the data is in the public domain (e.g PDDL, CC0) then we’ll host it for free. While the core data is all managed as RDF, there are ways to access the data, e.g. using a configurable search engine that supports facets, that non-semweb people can still get data out as RSS. The Platform also supports an XSLT service, so transforming a SPARQL query result into KML or CSV is easy to do with a bit of URL pipe-lining.

    Front that with a website for browsing datasets, and you’ve got the makings of exactly the kind of infrastructure you want to see.



  3. I forgot to mention that there’s also JSON output from nearly all of the services, so you can use whatever format is easier. Even the RDF output can be serialized to JSON.

  4. Hi Leigh

    Yes – sorry – forgot to mention Talis data hosting (partly because I haven’t had a chance to play with that envt yet..)

Comments are closed.

%d bloggers like this: