So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

COI web report 2009-10 table 1

COI web report 2009-10 table 2

COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings


COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed cost £11.78 per visit, while cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Using CSV Docs As a Database

Earlier today, my twinterwebs were full of the story about the COI announcing a reduction in the number of government websites:

Multiple sources of news... not...

(The told the story differently, at least…)

A little while later(?), a press released appeared on the COI website: Clamp down on Government websites to save millions, although by that time, via @lesteph, I’d found a copy of the interim report and a CSV dataset Reporting on progress: Central Government websites 2009-10.

So what can we do with that CSV file? How about turning it into a database, simply by uploading it to Google docs? Or importing it live (essentially, synching a spreadsheet with it) from the source URL on the COI website?

<a href="COI website report - costs column headings” title=”Photo Sharing”>Import csv into google docs

I had a quick play pruning the code on my Guardian Datastore explorer, and put together this clunky query tool that lets you explore the COI website data as if it was a database.

COI Website review CSV data explorer

The explorer allows you to bookmark views over the data, to a limited extent (the ascending/descending views aren’t implemented:-(, so for example, we can see:

websites with a non-zero “Very Poor Editorial” score
Age profile of visitors (where available)

(Feel free to share bookmarks to other views over the data in the comments to this post.)

Note that the displayed results table is an activie one so you can click on column headings to order the results by column values.

SOrting a table by colun

Note that there seem to be issues with columns not being recognised as containing numerical data (maybe something to do in part with empty cells in a column?), which means the chart views don’t work, but this page is not trying to do anything clever – it’s just a minimal query interface over the visualisation API from a spreadsheet. (To build a proper explorer for this dataset, we’d check the column data types were correct, and so on.)

Looking at the app, I think it’d probably useful to display a “human readable” version of the query too, that translates column identifiers to column headings for example, but that’s for another day…

GOTCHAS: use single quote (‘) rather than double quote (“) in the WHERE statements.

Related: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language
Using Google Spreadsheets Like a Database – The QUERY Formula

Manchester Digital Presentation – Open Council Data

I spent much of yesterday trying to come up with some sort of storyline for a presentations I’m giving in at a Manchester Digital Open Data: Concept & Practice event tomorrow evening on Open Civic Data, quite thankful that I’d bought myself a bit of slack with the original title: “Open Data Surprise”…

Anywhere, a draft of the slides as here (Manchester opendata presentation):

though as ever, they don’t necessarily give the full picture without me talking over them…

The gist of the presentation is basically as follows: there is an increasing number of local council websites out there making data available. By data, I mean stuff that developers, or tinkerers such as myself, can wrangle with, or council officers actually use as part of their job. The data that’s provided can be thought of along a spectrum ranging from fixed archival data (i.e. reports, things that aren’t going to change) to timely data, such as dates of forthcoming council elections, or details of current or planned roadworks. Somewhere in-between is data that changes on a regular cycle, such as the list of councillors, for example. The most timely of all data is live data, such as bus locations, or their estimated arrival time at a particular bus stop.

Lots of councils are starting to offer “data” via maps. What they are actually doing is providing information in a natural way, which is not a Bad Thing, although it’s not a Data Thing – if I wanted to create my own map view of the data, it’s generally not easy for me to do so. Providing data files that contain latitude and longitude is one way that councils can make the data available to users, but there is then a barrier to entry in terms of who can realistically make use of that data. Publishing geo-content in the KML format is one way we can can improve this, because tools such as Google Earth provide a ready way of rendering KML feeds that is accessible to many more users.

As a pragmatist, I believe that most people who use data do so in the context of a spreadsheet. This suggests that we need to make data available in that format, notwithstanding the problems that might arise from the difficulties of keeping an audit trail of the origins of that data in that format once files become merged. As a realist, I appreciate that most people don’t know that it’s possible to visualise their data, or what sorts of insight might be afforded by visualising the data. Nor do they know how it is becoming increasingly easy to create visualisations on top of data presented in an appropriate way. Just as using KML formats allows almost anyone to crate their own google Map by pasting a KML file URL into a Google maps search box, so the use of simple data formats such as CSV allow users to pull data into visualisation environments such as IBM’s Many Eyes. (For more on this line of thinking, see Programming, Not Coding: Infoskills for Journalists (and Librarians..?!;-)). As a data junkie, I think that the data should be “linkable” across data sets, and also queryable. As a contrarian, I think that Linked Data is maybe not the way forward at this time, at least in terms of consumer end-user evangelism/advocacy… (see also: So What Is It About Linked Data that Makes it Linked Data™?

The data that is starting to be published by many councils typically corresponds to the various functions of the council – finance, education, transport, planning, cultural and leisure services, for example. Through publishing this data in an open way, third party vertical providers can both aggregate data as information across councils, as well as adding contextual value. Some councils are entering into partnership with other councils to develop vertical services to which the council can publish it’s data, before pulling it back into the council’s own website via a data feed. And as to whose data it is anyway, it might be ours, but it’s also theirs: data as the business of government. Which makes me think: the most effective council data stores will be the ones that are used by councils are data consumers in their own right, rather than just as data publishers*.

(* There is a corollary here with open educational resources, I think? Institutions that make effective use of OERs are institutions who use at least their own OERs, as well as publishing them…?)

Recent communications from Downing Street suggest the new coalition government is serious in its aim to open up public data (though as @lesteph points out, this move towards radical transparency is not without its own attendant risks), so data releases are going to happen. The question is, are we going to help that data flow so that it can get to where it needs to go?

Plug’n’Play Public Data

Whilst listening to Radio 4’s Today programme this morning, I was pleasantly surprised to hear and interview with Hans Rosling about making stats’n’data relevant to Joe Public (you can find the interview, along with a video overview of the Gapminder software, here: Can statistics be beautiful?).

The last few weeks have seen the US Government getting into the data transparency business with the launch of whose purpose is “to increase public access to high value, machine readable datasets generated by the Executive Branch of the Federal Government”. The site offers access to a wide range of US Government datasets in a range of formats – XML, CSV, KML etc. (The site also gives links to widgets and other (parent?) sites that expose data.)

Providing URIs directly to CSV fils, for example, means that is is trivial to pull the data into online spreadsheets/databases, such as Google spreadsheets, or Dabble DB, or visualisation tools such as Many Eyes Wikified; and for smaller files, Yahoo Pipes provides a way of converting CSV or XML files to JSON that can be easily pulled in to a web page.

Realising that there may be some business in public data, Microsoft, Amazon and Google have all been sniffing around this area too: for Microsoft, it’s the Open Government Data Initiative (OGDI), for Amazon, it’s big data via Public Datasets on AWS, and for Google… well, Google. They bought Rosling’s Trendalyser, of course, and recently made a brief announcement about Public Data on Google, as well as Google Squared, which is still yet to be seen in public. With the publication of a Java support library for the Google Visualisation API open wire protocol/query language, you can see them trying to get their hooks into other people’s data. (The thing is, the query language is just so darned useful;-) Wolfram Alpha recently opened up their computational search over a wide range of curated data sets, and Yahoo? They’re trying to encourage people to make glue, I think, with YQL, YQL Execute and YQL Open Data Tables.

In the UK, we have the National Statistics website (I’m not even going to link to it, it’s that horrible..) as well as a scattered collection of resources as listed on the Rewired State: APIs wiki page; and, of course, the first steps of a news media curated datastore from the Guardian.

But maybe things are set to change? In a post on the Cabinet Office Digital Engagement blog, Information and how to make it useful, Richard (Stirling?) picks up on Recommendation 14 of the POIT (Power of Information Taskforce) Review Final Report, which states:

Recommendation 14
The government should ensure that public information data sets are easy to find and use. The government should create a place or places online where public information can be stored and maintained (a ‘repository‘) or its location and characteristics listed (an online catalogue). Prototypes should be running in 2009.

and proposes starting a conversation about “a UK version of”:

What characteristics would be most useful to you – feeds (ATOM or RSS) or bulk download by e.g. FTP, etc?
Should this be an index or a repository?
Should this serve particular types of data e.g. XML, JSON or RDF?
What examples should we be looking at (beyond e.g.
Does this need it’s own domain, or should it sit on an existing supersite (e.g.

I posted my starter for 10 thoughts as a comment to that post (currently either spamtrapped, or laughed out of court), but there’s already some interesting discussion started there, as well as thoughtful response on Steph Gray’s Helpful Technology blog (Cui bono? The problem with opening up data) which picks up on “some more fundamental problems than whether we publish the data in JSON or RSS” such as:

– Which data?
– Who decides whether to publish?
– Who benefits?
– Who pays?
– For how long?

My own stance is from a purely playful, and maybe even a little pragmatic, position: so what?

There are quite a few ways of interpreting this question of course, but the direction I’ll come at it (in this post at least) is in terms of use by people whose job it isn’t…

Someone like me… so a population of one, then… ;-)

So what do I know? I know how to cut and paste URLs in to things, and I know how to copy other peoples’ code and spot what bits I need to change so that it does “stuff with my stuff”.

I know that I can import CSV and Excel spreadsheets that are hosted online from their URL into Google spreadsheets, and from a URL as CSV into something like Dabble DB (which also lets me essentially merge data from two sources into a new data table). Yahoo Pipes also consumes CSV. I know that I can get CSV out of a Google spreadsheet or Dabble DB (or from a Yahoo pipe if CSV went in). I know that I can plot KML or geoRSS files on a Google map simply by pasting the URL into a Google map search box. I know I can get simple XML into a Google spreadsheet, and more general XML into a Yahoo Pipe. I know that YQL will also let me interrogate XML files and emit the results as XML or JSON. Pipes is good as emitting JSON too. (JSON is handy because you can pull it into a web page without requiring and help from script running on a server.) I’ve recently discovered that the Google Visualisation API query language and open wire protocol lets me run queries on datastores that support it, such as Google spreadsheets and Pachube. I know that Many Eyes Wikified will ingest CSV and then allow me to easily create a set of interactive visualisation

So what would I want from a UK version of, and why?

– CSV, XML and JSON output, with KML/GeoRSS where appropriate, keyed by a simple URI term;
– a sensible (i.e. a readable, hackable) URI pattern for extracting data: good examples are the BBC Programmes website and Google spreadsheets (e.g. where you can specify cell ranges);
– data available from a URI via an HTTP GET (not POST; GETable resources are easily pulled into other services, POST requested ones aren’t; don’t even think about SOAP;-);
– if possible, being able to query data or extract subsets of it: YQL and the Google Viz API query language show a possible way forward here. Supporting the Google open-wire protocol, or defining YQL open data tables for data sets brings the data into an environment where it can be interrogated or subsetted. (Pulling cell ranges from spreadsheets is only useful where the cells you want are contiguous.)

Although it pains me to suggest hooking into yet more of the Googleverse, a UK version of could do worse than support the Google visualization API open-wire protocol. Why? Well, for example, with only an hour or two’s coding, I was able to pull together a site that added a front end on to the Guardian datastore files on Google spreadsheets: First Steps Towards a Generic Google Spreadsheets Query Tool, or At Least, A Guardian Datastore Interactive Playground (Okay, okay, I know – it shows that I only spent a couple of hours on it… but it was enough to demonstrate a sort of working rapid prototype…;-)

As to whether the data is useful, or who’s going to use it, or why they’re going to use it, I don’t know: but I suspect that if it isn’t easy to use, then people won’t. If one of the aims of style approaches is to engage people in conversations with data, we need to make it easy for them. Essentially, we want people to engage in – not quite ‘enterprise mashups’, more civic mashups. I’m not sure who these people are likely to be – activitists, policy wonks, journalists, concrned citizens, academics, students – but they’re probably not qualified statisticians with a blackbelt in R or SPSS.

So for example, even the Guardian datastore data is quite hard to play with for most people (it’s just a set of spreadsheets, right? So what can I actually do with them?). In contrast, the New York Times Visualization Lab folks have started looking at making it easier for readers to intrrogate the data in a visual way with Many Eyes Wikified, which is one reason I started trying to think about what a query’n’visualisation API to the Guardian datastore might look like…

PS just in case the Linked Data folks feel left out, I still think RDF and semweb geekery is way too confusing for mortals. Things like SPARCool are starting to help, but IMHO it’s still way too quirky syntactic for a quick hit… SQL and SQL like languages are hard enough, especially when you bear in mind that most people don’t know (or care) that advanced search exists on web search engines, let alone what it does or how to use it.

PPS see also National Research Council Canada: Gateway to Scientific Data (via Lorcan Dempsey).