Plug’n’Play Public Data

Whilst listening to Radio 4’s Today programme this morning, I was pleasantly surprised to hear and interview with Hans Rosling about making stats’n’data relevant to Joe Public (you can find the interview, along with a video overview of the Gapminder software, here: Can statistics be beautiful?).

The last few weeks have seen the US Government getting into the data transparency business with the launch of data.gov whose purpose is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government”. The site offers access to a wide range of US Government datasets in a range of formats – XML, CSV, KML etc. (The site also gives links to widgets and other (parent?) sites that expose data.)

Providing URIs directly to CSV fils, for example, means that is is trivial to pull the data into online spreadsheets/databases, such as Google spreadsheets, or Dabble DB, or visualisation tools such as Many Eyes Wikified; and for smaller files, Yahoo Pipes provides a way of converting CSV or XML files to JSON that can be easily pulled in to a web page.

Realising that there may be some business in public data, Microsoft, Amazon and Google have all been sniffing around this area too: for Microsoft, it’s the Open Government Data Initiative (OGDI), for Amazon, it’s big data via Public Datasets on AWS, and for Google… well, Google. They bought Rosling’s Trendalyser, of course, and recently made a brief announcement about Public Data on Google, as well as Google Squared, which is still yet to be seen in public. With the publication of a Java support library for the Google Visualisation API open wire protocol/query language, you can see them trying to get their hooks into other people’s data. (The thing is, the query language is just so darned useful;-) Wolfram Alpha recently opened up their computational search over a wide range of curated data sets, and Yahoo? They’re trying to encourage people to make glue, I think, with YQL, YQL Execute and YQL Open Data Tables.

In the UK, we have the National Statistics website (I’m not even going to link to it, it’s that horrible..) as well as a scattered collection of resources as listed on the Rewired State: APIs wiki page; and, of course, the first steps of a news media curated datastore from the Guardian.

But maybe things are set to change? In a post on the Cabinet Office Digital Engagement blog, Information and how to make it useful, Richard (Stirling?) picks up on Recommendation 14 of the POIT (Power of Information Taskforce) Review Final Report, which states:

Recommendation 14
The government should ensure that public information data sets are easy to find and use. The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue). Prototypes should be running in 2009.

and proposes starting a conversation about “a UK version of data.gov”:

What characteristics would be most useful to you – feeds (ATOM or RSS) or bulk download by e.g. FTP, etc?
Should this be an index or a repository?
Should this serve particular types of data e.g. XML, JSON or RDF?
What examples should we be looking at (beyond data.gov e.g. http://ideas.welcomebackstage.com/data)?
Does this need it’s own domain, or should it sit on an existing supersite (e.g. http://direct.gov.uk)?

I posted my starter for 10 thoughts as a comment to that post (currently either spamtrapped, or laughed out of court), but there’s already some interesting discussion started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON or RSS” such as:

– Which data?
– Who decides whether to publish?
– Who benefits?
– Who pays?
– For how long?

My own stance is from a purely playful, and maybe even a little pragmatic, position: so what?

There are quite a few ways of interpreting this question of course, but the direction I’ll come at it (in this post at least) is in terms of use by people whose job it isn’t…

Someone like me… so a population of one, then… ;-)

So what do I know? I know how to cut and paste URLs in to things, and I know how to copy other peoples’ code and spot what bits I need to change so that it does “stuff with my stuff”.

I know that I can import CSV and Excel spreadsheets that are hosted online from their URL into Google spreadsheets, and from a URL as CSV into something like Dabble DB (which also lets me essentially merge data from two sources into a new data table). Yahoo Pipes also consumes CSV. I know that I can get CSV out of a Google spreadsheet or Dabble DB (or from a Yahoo pipe if CSV went in). I know that I can plot KML or geoRSS files on a Google map simply by pasting the URL into a Google map search box. I know I can get simple XML into a Google spreadsheet, and more general XML into a Yahoo Pipe. I know that YQL will also let me interrogate XML files and emit the results as XML or JSON. Pipes is good as emitting JSON too. (JSON is handy because you can pull it into a web page without requiring and help from script running on a server.) I’ve recently discovered that the Google Visualisation API query language and open wire protocol lets me run queries on datastores that support it, such as Google spreadsheets and Pachube. I know that Many Eyes Wikified will ingest CSV and then allow me to easily create a set of interactive visualisation

So what would I want from a UK version of data.gov, and why?

– CSV, XML and JSON output, with KML/GeoRSS where appropriate, keyed by a simple URI term;
– a sensible (i.e. a readable, hackable) URI pattern for extracting data: good examples are the BBC Programmes website and Google spreadsheets (e.g. where you can specify cell ranges);
– data available from a URI via an HTTP GET (not POST; GETable resources are easily pulled into other services, POST requested ones aren’t; don’t even think about SOAP;-);
– if possible, being able to query data or extract subsets of it: YQL and the Google Viz API query language show a possible way forward here. Supporting the Google open-wire protocol, or defining YQL open data tables for data sets brings the data into an environment where it can be interrogated or subsetted. (Pulling cell ranges from spreadsheets is only useful where the cells you want are contiguous.)

Although it pains me to suggest hooking into yet more of the Googleverse, a UK version of data.gov could do worse than support the Google visualization API open-wire protocol. Why? Well, for example, with only an hour or two’s coding, I was able to pull together a site that added a front end on to the Guardian datastore files on Google spreadsheets: First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (Okay, okay, I know – it shows that I only spent a couple of hours on it… but it was enough to demonstrate a sort of working rapid prototype…;-)

As to whether the data is useful, or who’s going to use it, or why they’re going to use it, I don’t know: but I suspect that if it isn’t easy to use, then people won’t. If one of the aims of data.gov style approaches is to engage people in conversations with data, we need to make it easy for them. Essentially, we want people to engage in – not quite ‘enterprise mashups’, more civic mashups. I’m not sure who these people are likely to be – activitists, policy wonks, journalists, concrned citizens, academics, students – but they’re probably not qualified statisticians with a blackbelt in R or SPSS.

So for example, even the Guardian datastore data is quite hard to play with for most people (it’s just a set of spreadsheets, right? So what can I actually do with them?). In contrast, the New York Times Visualization Lab folks have started looking at making it easier for readers to intrrogate the data in a visual way with Many Eyes Wikified, which is one reason I started trying to think about what a query’n’visualisation API to the Guardian datastore might look like…

PS just in case the Linked Data folks feel left out, I still think RDF and semweb geekery is way too confusing for mortals. Things like SPARCool are starting to help, but IMHO it’s still way too quirky syntactic for a quick hit… SQL and SQL like languages are hard enough, especially when you bear in mind that most people don’t know (or care) that advanced search exists on web search engines, let alone what it does or how to use it.

PPS see also National Research Council Canada: Gateway to Scientific Data (via Lorcan Dempsey).

Manchester Digital Presentation – Open Council Data

I spent much of yesterday trying to come up with some sort of storyline for a presentations I’m giving in at a Manchester Digital Open Data: Concept & Practice event tomorrow evening on Open Civic Data, quite thankful that I’d bought myself a bit of slack with the original title: “Open Data Surprise”…

Anywhere, a draft of the slides as here (Manchester opendata presentation):

though as ever, they don’t necessarily give the full picture without me talking over them…

The gist of the presentation is basically as follows: there is an increasing number of local council websites out there making data available. By data, I mean stuff that developers, or tinkerers such as myself, can wrangle with, or council officers actually use as part of their job. The data that’s provided can be thought of along a spectrum ranging from fixed archival data (i.e. reports, things that aren’t going to change) to timely data, such as dates of forthcoming council elections, or details of current or planned roadworks. Somewhere in-between is data that changes on a regular cycle, such as the list of councillors, for example. The most timely of all data is live data, such as bus locations, or their estimated arrival time at a particular bus stop.

Lots of councils are starting to offer “data” via maps. What they are actually doing is providing information in a natural way, which is not a Bad Thing, although it’s not a Data Thing – if I wanted to create my own map view of the data, it’s generally not easy for me to do so. Providing data files that contain latitude and longitude is one way that councils can make the data available to users, but there is then a barrier to entry in terms of who can realistically make use of that data. Publishing geo-content in the KML format is one way we can can improve this, because tools such as Google Earth provide a ready way of rendering KML feeds that is accessible to many more users.

As a pragmatist, I believe that most people who use data do so in the context of a spreadsheet. This suggests that we need to make data available in that format, notwithstanding the problems that might arise from the difficulties of keeping an audit trail of the origins of that data in that format once files become merged. As a realist, I appreciate that most people don’t know that it’s possible to visualise their data, or what sorts of insight might be afforded by visualising the data. Nor do they know how it is becoming increasingly easy to create visualisations on top of data presented in an appropriate way. Just as using KML formats allows almost anyone to crate their own google Map by pasting a KML file URL into a Google maps search box, so the use of simple data formats such as CSV allow users to pull data into visualisation environments such as IBM’s Many Eyes. (For more on this line of thinking, see Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)). As a data junkie, I think that the data should be “linkable” across data sets, and also queryable. As a contrarian, I think that Linked Data is maybe not the way forward at this time, at least in terms of consumer end-user evangelism/advocacy… (see also: So What Is It About Linked Data that Makes it Linked Data™?

The data that is starting to be published by many councils typically corresponds to the various functions of the council – finance, education, transport, planning, cultural and leisure services, for example. Through publishing this data in an open way, third party vertical providers can both aggregate data as information across councils, as well as adding contextual value. Some councils are entering into partnership with other councils to develop vertical services to which the council can publish it’s data, before pulling it back into the council’s own website via a data feed. And as to whose data it is anyway, it might be ours, but it’s also theirs: data as the business of government. Which makes me think: the most effective council data stores will be the ones that are used by councils are data consumers in their own right, rather than just as data publishers*.

(* There is a corollary here with open educational resources, I think? Institutions that make effective use of OERs are institutions who use at least their own OERs, as well as publishing them…?)

Recent communications from Downing Street suggest the new coalition government is serious in its aim to open up public data (though as @lesteph points out, this move towards radical transparency is not without its own attendant risks), so data releases are going to happen. The question is, are we going to help that data flow so that it can get to where it needs to go?

Using CSV Docs As a Database

Earlier today, my twinterwebs were full of the story about the COI announcing a reduction in the number of government websites:

Multiple sources of news... not...

(The told the story differently, at least…)

A little while later(?), a press released appeared on the COI website: Clamp down on Government websites to save millions, although by that time, via @lesteph, I’d found a copy of the interim report and a CSV dataset Reporting on progress: Central Government websites 2009-10.

So what can we do with that CSV file? How about turning it into a database, simply by uploading it to Google docs? Or importing it live (essentially, synching a spreadsheet with it) from the source URL on the COI website?

<a href="COI website report - costs column headings” title=”Photo Sharing”>Import csv into google docs

I had a quick play pruning the code on my Guardian Datastore explorer, and put together this clunky query tool that lets you explore the COI website data as if it was a database.

COI Website review CSV data explorer

The explorer allows you to bookmark views over the data, to a limited extent (the ascending/descending views aren’t implemented:-(, so for example, we can see:

websites with a non-zero “Very Poor Editorial” score
Age profile of visitors (where available)
Costs

(Feel free to share bookmarks to other views over the data in the comments to this post.)

Note that the displayed results table is an activie one so you can click on column headings to order the results by column values.

SOrting a table by colun

Note that there seem to be issues with columns not being recognised as containing numerical data (maybe something to do in part with empty cells in a column?), which means the chart views don’t work, but this page is not trying to do anything clever – it’s just a minimal query interface over the visualisation API from a spreadsheet. (To build a proper explorer for this dataset, we’d check the column data types were correct, and so on.)

Looking at the app, I think it’d probably useful to display a “human readable” version of the query too, that translates column identifiers to column headings for example, but that’s for another day…

GOTCHAS: use single quote (‘) rather than double quote (“) in the WHERE statements.

Related: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language
Using Google Spreadsheets Like a Database – The QUERY Formula

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Principles for, and Practicalities of, Open Public Data

Following the first meeting of the Public Sector Transparency Board last week, which is tasked with “driv[ing] forward the Government’s transparency agenda, making it a core part of all government business, a set of 11 draft public data principles have been posted for comment on the data.gov.uk wiki: Draft Public Data Principles [wiki version]

Following the finest Linked Data principles, each draft principle has its own unique URI… err, only it doesn’t… ;-) [here’s how they might look on WriteToReply – WTR: Draft Public Data Principles – with unique URLs for each principle;-)]

The principles broadly state that users have a right to open public data, and that data should be published in ways that make it useable and useful (so machine readable, not restrictively licensed, easily discoverable and standards based, timely and fine grained). In addition, data unerlying government websites will be made available (as for example in the case of the DirectGov Syndication API?) Other public bodies will be encouraged to publish inventories of their data holdings and make it available for reuse.

A separate blog post on the data.gov.uk blog describes some of the practical difficulties that local government offices might face when opening up their data: Publishing Local Open Data – Important Lessons from the Open Election Data project (Again, unique URLs for individual lessons are unavailable, but here’s an example of how we might automatically generate identifiers for them;-) WTR: Lessons Learned from the Open Election Data Project). The lessons learned include a lack of corporate awareness about open data issues (presumably there is a little more awareness since the PM’s letter to councils on opening up data), a lack of web skills and web publishing resources, not to mention a very limited understanding of, and tools available for the handling of, Linked Data.

As to what data might be opened up, particularly by local councils, Paul Clarke identifies several different classes of data (There’s data, and there’s data):

historical data;
planning data;
infrastructural data;
operational data.

My own take on it can be seen here:


(An updated version of this presentation, with full annotations, should be available in a week or two!)

Looking around elsewhere, Local government data: lessons from London suggests:

– “don’t start hiring big, expensive consultancy firms for advice”;
– “do draw on the expertise and learning already there”;
– “do remember that putting the data out there of itself is not enough – it must be predicated on a model of engagement.”

Picking up on that last point, the debate regarding the “usefulness” of data has a long way to run, I think? Whilst I would advocate lowering barriers to entry (which means making data discoverable, queryable (so particular views over it can be expressed), and available in “everyday” CSV and Excel spreadsheet formats) there is also the danger that if we put the requirement for data to be demonstrably useful to publishers, this will deter them from opening up data they don’t perceive to be useful. In this respect, the Draft Public Data Principle that states:

Public data policy and practice will be clearly driven by the public and businesses who want and use the data, including what data is released when and in what form – and in addition to the legal Right To Data itself this overriding principle should apply to the implementation of all the other principles.

should help ensure that an element of “pull” can be used to ensure the release of data that others know how to make useful, or need to make something else useful…

On the “usefulness” front, it’s also worth checking out Ingrid Koehler’s post Sometimes you have to make useful yourself, which mentions existence value and accountability value as well as value arising from “meta” operations such as the ability to compare data across organisation operating in similar areas (such as local councils, or wards within a particular council).

For my own part, I’ve recently started looking at ways in which can can generate transparency in reporting and policy development by linking summary statistics back to the original data (e.g. So Where Do the Numbers in Government Reports Come From?), a point also raised in Open data requires responsible reporting… and the comments that follow it). Providing infrastructure that supports this linkage between summary reported data and the formulas used to generate those data summaries is something that I think would help make open data more useful for transparency purposes, although it sits at a higher level than the principles governing the straightforward publication and release of open public data.

See also:
Ben Goldacre on when the data just isn’t enough without the working…
Publishing itemised local authority expenditure – advice for comment

UK Open Data Guidance Resources

This is a live post where I will try to collect together advice relating to the release and use of open public data in the UK, as much for my own reference as anything… (I guess this really should be a wiki page somewhere…?)

Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)