OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)

Written by Tony Hirst

June 28, 2010 at 9:22 am

10 Responses

Subscribe to comments with RSS.

  1. SIO, SEO and HEO are references to (permanent) staff grades.

    Simon Dickson

    June 28, 2010 at 10:29 am

  2. [...] So Where Do the Numbers in Government Reports Come From? « OUseful.Info, the blog… [Just an aside on summary claims made by, or "discovered" by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. ("According to the Guardian/Times/Telegraph/FT, etc etc etc". To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.] [...]

  3. Hey Tony

    I’ve been wondering about the need for a ‘Show Your Working’ convention for open data use – building on the need to at very least note the source (http://www.timdavies.org.uk/2010/06/04/open-data-requires-responsible-reporting/).

    I wonder if some small ‘ShowYourWorking’ widget might be handy – something that people using data could easily embed / link to in any case of open data use and which would note:

    * Which datasets were used;
    * What transformations have been applied to them;

    Given most people are loathe to spend too much time documenting the process they’ve just been through using data – something that made that easy – and easy to interpret for the end user, could be a useful tool…

    Will try and write this up as a more developed idea if time allows…

    Tim

    Tim Davies

    June 28, 2010 at 4:37 pm

  4. Thx for the info, @simon…

    @tim Your post hits the spot for me… One thing I’d love to see for the data.gov.uk site would be an area to bookmark working SPARQL queries over the Linked Data datastores so that folk can: a) start sharing queries and bootstrapping the writing of new queires by example/modding previous queries; b) start archiving and sharing queries that pull on the data stores to generate data that appears in reports and policy docs (is SPARQL expressive enough to allow things like different numerical items to be combined arithmetically, as the Google QL is (shown above)?)

    I started doodling some ideas around this some time ago – http://blog.ouseful.info/2009/12/09/programming-pipes-with-delicious-and-sharing-data-gov-uk-sparql-queries-as-a-result/ – but that approach was very clunky. What would be better would be something like a combination of posterous, stackoverflow, gist.github and codepad/scriptlets, that is a site that allows you to:
    a) share a query
    b) with an endpoint
    c) preview the result
    d) write a description about the query and the sort of results it obtains
    e) get a URL for the query page
    f) get a URL for the results of the query as RDF, json, xml, csv etc
    g) clone and edit the query, linking it back to the parent

    In addition, users should also be able to ask a question about how to acquire a certain dataset, or run a particular query, and get a worked example, as described above, offered to them as a response/answer to their question.

    I did wonder whether a Stack Overflow site (e.g. http://area51.stackexchange.com/proposals/10408/linked-data-queries ) might be minimally useful in that regard, but I think the ability to run, preview and edit (or clone and edit) queries is where the power would come in?

    Tony Hirst

    June 29, 2010 at 12:31 pm

  5. [...] in reporting and policy development by linking summary statistics back to the original data (e.g. So Where Do the Numbers in Government Reports Come From?), a point also raised in Open data requires responsible reporting… and the comments that follow [...]

  6. [...] there should be some transparency in declaring how the derived work was created (see for example: So Where Do the Numbers in Government Reports Come From? and Data is not [...]

  7. [...] traditional role of intermediating between government and public (on this subject, these posts by Tony Hirst and Tim Davies are worth [...]

  8. [...] For an example use case, see So Where Do the Numbers in Government Reports Come From?. [...]

  9. [...] 2) you can show your working. Reports often contain summary data tables or charts generated from raw data sets. The tables and charts appear in PDF documents, and the raw data is dumped as one or more spreadsheets or database tables. The query that is used to generate the summary data table, or chart, or as in the case above, the profit and loss, is typically not released. And that’s the bit that needs to be transparent at least as much as, if not more so, than the data. (I referred to this as a query path in Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; see also So Where Do the Numbers in Government Reports Come From?). [...]

  10. [...] UKGov has posted a certain amount of costings related data around website provision (for example, So Where Do the Numbers in Government Reports Come From?), so if there are any armchair web analysts/auditors out there (unlikely, I know;-), it seems as if [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 784 other followers

%d bloggers like this: