So What’s Open Government Data Good For? Government and “Independent Advisers”, maybe?

Although I got an invite to today’s “Government Transparency: Opening Up Public Services” briefing, I didn’t manage to attend (though I’m rather wishing I had), but I did manage to keep up with what was happening through the #openuk hashtag commentary.

#openuk tweeps

It all kicked off with the Prime Minister’s Letter to Cabinet Ministers on transparency and open data, which sets out the roadmap for government data releases over the coming months in the areas of health, education, criminal justice, transport and public spending; it also sets the scene for the forthcoming Open Public Services White Paper (see also the public complement to that letter: David Cameron’s article in The Telegraph on transparency).

The Telegraph article suggests there will be a “profound impact” in four areas:

– First, it will enable choice, particularly for patients and parents. …
– Second, it will raise standards. All the evidence shows that when you give professionals information about other people’s performance, there’s a race to the top as they learn from the best. …
– Third, this information is going to help us mend our economy. To begin with, it’s going to save money. Already, the information we have published on public spending has rooted out waste, stopped unnecessary duplication and put the brakes on ever-expanding executive salaries. Combine that with this new information on the performance of our public services, and there will be even more pressure to get real value for taxpayers’ money.
– But transparency can help with the other side of the economic equation too – boosting enterprise. Estimates suggest the economic value of government data could be as much as £6 billion a year. Why? Because the possibilities for new business opportunities are endless. Imagine the innovations that could be created – the apps that provide up-to-date travel information; the websites that compare local school performance. But releasing all this data won’t just support new start-ups – it will benefit established industries too.

David Cameron’s article in The Telegraph on transparency

All good stuff… all good rhetoric. But what does that actually mean? What are people actually going to be able to do differently, Melody?

As far as I can tell, the main business models for making money on the web are:

sell the audience: the most obvious example of this is to sell adverts to the visitors of your site. The rate advertisers pay is dependent on the number of people who see the adds, and their specificity (different media attract different, possibly niche, audiences. If an audience is the one you’re particularly trying to target, you’re likely to pay more than you would for a general audience, in part because it means you don’t have to go out and find that focussed audience yourself.) Another example is to sell information about the users of your site (for example, banks selling shopping data).

take a cut: so for example, take an affiliate fee, referral fee or booking fee for each transaction brokered through your site, or levy some other transaction cost.

Where data is involved, there is also the opportunity to analyse other peoples’ data and then sell analysis of that data back to the pubishing organisations as consultancy. Or maybe use that data to commercial advantage in put together tenders and approaches to public bodies?

When all’s said and done, though, the biggest potential is surely within government itself? By making data from one department or agency available, other departments or agencies will have easier access to it. Within departments and agencies too, open data has the potential to reduce friction and barriers to access, as well as opening up the very existence of data sets that may be being created in duplicate fashion across areas of government.

By consuming their own and each others’ open data, departments will also start to develop processes that improve the cleanliness and quality of data sets, (for example, see Putting Public Open Data to Work…? and Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; Library Location Data on data.gov.uk gives examples of how the same data can be released in several different (i.e. not immediately consistent) ways).

I’m more than familiar with the saying that “the most useful thing that can be done with your data will probably be done by someone else”, but if an organisation can’t find a way to make use of its own data, why should anyone else even try?! Especially if it means they have to go through the difficulty of cleaning the published data and preparing it for first use. By making use of open data as part of everyday government processes: a) we know the data’s good (hopefully!); b) cleanliness and inconsistency issues will be detected by the immediate publisher/user of the data; c) we know the data will have at least one user.

Finally, one other thing that concerns me is the extent to which “the public” want access to data in order to provide choice. As far as I can tell, choice is often the enemy of contentment; choice can sow the seeds of doubt and inner turmoil when to all intents and purposes there is no choice. I live on an island with a single hospital and not the most effective of rural transport systems. I’d guess the demographics of the island skew old and poor. So being able to “choose” a hospital with performance figures better than the local one for a given procedure is quite possibly no choice at all if I want visitors, or to be able to attend the hospital as an outpatient.

But that’s by the by: because the real issues are that the data that will be made available will in all likelihood be summary statistic data, which actually masks much of the information you’d need to make an informed decision; and if there is any meaningful intelligence in the data, or its summary statistics, you’ll need to know how to interpret the statistics, or even just read the pretty graphs, in order to take anything meaningful form them. And therein lies a public education issue…

Maybe then, there is a route to commercialisation of public facing public data? By telling people the data’s there for you to make the informed choice, the lack of knowledge about how to use that information effectively will open up (?!) a whole new sector of “independent advisers”: want to know how to choose a good school? Ask your local independent education adviser; they can pay for training on how to use the monolithic, more-stats-than-you-can-throw-a-distribution-at one-stop education data portal and charge you to help you decide which school is best for your child. Want comforting when you have to opt for treatment in a hospital that the league tables say are failing? Set up an appointment with your statistical counsellor, who can explain to you that actually things may not be so bad as you fear. And so on…

Government Spending Data Explorer

So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:

Guardian/gov datastore explorer

The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:

Guardian datastore selector - gov spending data

When you select a spreadsheet, you can preview the column headings:

Datastore explorer - preview

Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:

Education spend - over 100 million

The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:

Datstrore explorer - build a query

You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.

Several other example queries are given at the bottom of the data explorer page.

For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:

Education spend - over 250 million

Here’s how we construct the query:

Education - query spend over 250 million

If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.

Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)

Practical Data Scraping – UK Government Transparency Data (Minister’s Meetings)

Earlier this week, I came across the Number 10 website’s transparency data area, which among other things has a section on who Ministers are meeting.

Needless to say, the Who’s Lobbying website has started collating this data and making it searchable, but I thought I’d have a look at the original data to see what it would take to aggregate the data myself using Scraperwiki.

The Number 10 transparency site provides a directory to Ministers’ meetings by government department on a single web page:

Number 10 transparency - ministers meetings

The links in the Ministers’ meetings, Ministers’ hospitality, Ministers’ gifts and Ministers’ overseas travel columns all point directly to CSV files. From inspecting a couple of the Ministers’ meetings CSV files, it looks as if they may be being published in a standardised way, using common column headings presented in the same order:

Ministers' meetings transparency data - csv format

Except that: some of the CSV files appeared to have a blank row between the header and the data rows, and at least one table had a blank row immediately after the data rows, followed some notes in cells that did not map onto the semantics of corresponding column headers. Inspecting the data, we also see that once a minister is identified, there is a blank in the first (Minister) column, so we must presumably assume that the following rows relate to meetings that minister had. WHen the data moves on to another minister, that Minister’s name/position is identified in the first column, once again then followed by blank “same as above” cells.

To get the data into scraperwiki means we need to do two things: extract meeting data from a CSV document and get it into a form whereby we can put it into the scraperwiki database; scrape the number 10 Minisiters’ meetings webpage to get a list of the URLs that point to the CSV files for each department. (It might also be worth scraping the name of the department, and adding that as additional metadata to each record pulled out from the CSV docs.)

Here’s the Scraperwiki code I used to scrape the data. I tried to comment it, so it’s worth reading through even if you don’t speak Python, because I’m not going to provide any more description here…;-)

import urllib
import csv
import md5
import scraperwiki


url = "http://download.cabinetoffice.gov.uk/transparency/co-ministers-meetings.csv"
# I have started just looking at data from one source.
# I am assuming, (dangerously), that the column headings are:
#   a) the same, and 
#   b) in the same order
# for different departments

data = csv.DictReader(urllib.urlopen(url))

# Fudge to cope with possibility of blank row between header and first data row
started=False

# Inspection of the data file suggests that when we start considering a Minister's appointments,
#   we leave the Minister cell blank to mean "same as above".
# If we want to put the Minister's name into each row, we need to watch for that. 
minister=''

for d in data:
    if not started and d['Minister']=='':
        # Skip blank lines between header and data rows
        continue
    elif d['Minister']!='':
        # A new Minister is identified, so this becomes the current Minister of interest
        if not started:
            started=True
        minister=d['Minister']
    elif d['Date']=='' and d['Purpose of meeting']=='' and d['Name of External Organisation']=='':
        # Inspection of the original data file suggests that there may be notes at the end of the CSV file...
        # One convention appears to be that notes are separated from data rows by at least one blank row
        # If we detect a blank row within the dataset, then we assume we're at data's end
        # Of course, if there are legitimate blank rows within the later, we won't scrape any of the following data
        # We probably shouldn't discount the notes, but how would we handle them?!
        break
    print minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']
    id='::'.join([minister,d['Date'],d['Purpose of meeting'],d['Name of External Organisation']])
    # The md5 function creates a unique ID for the meeting
    id=md5.new(id).hexdigest()
    # Some of the original files contain some Latin-1 characters (such as right single quote, rather than apostrophe)
    #   that make things fall over unless we handle them...
    purpose=d['Purpose of meeting'].decode('latin1').encode('utf-8')
    record={'id':id,'Minister':minister,'date':d['Date'],'purpose':purpose,'lobbiest':d['Name of External Organisation'].decode('latin1').encode('utf-8')}
    # Note that in some cases there may be multiple lobbiests, separated by a comma, in the same record.
    # It might make sense to generate a meeting MD5 id using the original record data, but actually store
    #   a separate record for each lobbiest in the meeting (i.e. have lobbiests and lobbiest columns) by separating on ','
    # That said, there are also records where a comma separates part of the title or affiliation of an individual lobbiest.
    # A robust convention for separating different lobbiests in the same meeting (e.g. ';' rather than ',') would help

    scraperwiki.datastore.save(["id"], record) 

for d in data:
    #use up the generator, close the file, allow garbage collection?
    continue

Here’s a preview of what the scraped data looks like:

Ministers' meetings datascrape - scraperwiki

Here’s the scraper itself, on Scraperwiki: UK Government Transparency Data – Minister’s Meetings Scratchpad

Assuming that the other CSV files are all structured the same way as the one I tested the above scraper on, we should be able to scrape meeting data from other departmental spreadsheets using the same script. (Note that I did try to be defensive in the handling of arbitrary blank lines between the first header row and the data.)

One problem arises in the context of meetings with more than one person. Ideally, I think there should be a separate row for each person attending, so for example, the Roundtable on June, 2010 between Parliamentary Secretary (Minister for Civil Society), Nick Hurd MP and National Voices, MENCAP,National Council of Voluntary Organisations, St Christopher’s Hospice, Diabetes UK, Place 2 Be, Terrence Higgins Trust, British Heart Foundation, Princess Royal Trust for Carers, Clic Sargent might be mapped to separate data rows for each organisation present. If we take this approach, it might also make sense to ensure that each row carries with it a meeting ID, so that we can group all the rows relating to a particular meeting (one for each group in the meeting) on meeting ID.

However, there is an issue in identifying multiple attendee meetings. In the above example, we can simply separate the groups by splitting the attendees lists at each comma; but using this approach would then mean that the meeting with Secretary General, Organisation of the Islamic Conference, Ekmelledin Ihsanoglu would be mapped onto three rows for that meeting: one with Secretary General as an attendee, one with Organisation of the Islamic Conference as an attendee, and finally one with Ekmelledin Ihsanoglu identified as an attendee…

What this suggests to me is that it would be really handy (in data terms), if a convention was used in the attendees column that separated representation from different organisations with a semi-colon, “;”. We can then worry about how to identify numerous individuals from the same organisation (e.g. J Smith, P Brown, Widget Lobbying group), or how to pull out roles from organisations (Chief Lobbiest, Evil Empire Allegiance), names and roles from organisations (J Smith, Chief Lobbiest, UN Owen, Head Wrangler, Evil Empire Allegiance) and so on…

And I know, I know… the Linked Data folk would be able to model that easily.. but I’m talking about quick and dirty typographical conventions that can be easily used in simple CSV docs that more folk are comfortable with than are comfortable with complex, explicitly structured data…;-)

PS I’ll describe how to scrape the CSV urls from the Number 10 web page, and then loop through all of this to generate a comprehensive “Ministers’ meetings” database in a later post…

PPS a really informative post on the WHo’s Lobbying blog goes into further detrail about some of the “pragmatic reuse” problems associated with the “Ministers’ meetings” data released to date: Is this transparency? No consistent format for 500 more UK ministerial meetings.

Show Me the Data – But Not All of It (Just a Little Bit…)

Over the weekend, I managed to catch up with open data advocate Rufus Pollock for a bit of a chat on all manner of things. One of the things that came up in conversation related to a practical issue around the ability to preview data quickly and easily without having to download and open large data files that might turn out not to contain the data you were looking for.

When building data handling applications, it can also be useful to get your code working on a small number of sampled data rows, rather than a complete data file.

Anyway, here’s one possible solution if your data is in a Google Spreadsheet – a URL pattern that will provide you with an HTML preview of the first ten lines of a sheet:

http://spreadsheets.google.com/tq?tqx=out:html&tq=select%20*%20limit%2010&key=SPREADSHEET_KEY&gid=SHEET_NUMBER

What it does is look-up a particular sheet in a public/published Google spreadsheet, select every column (select *) and then limit the display to the first 10 rows (limit 10 – just change the number to preview more or less rows. If there is only one sheet in the spreadsheet, or to display the first sheet, you can remove the &gid=SHEET_NUMBER part of the URL).

And if you’d rather have CSV data than an HTML preview, just change the out:html switch to out:csv

So for example, here’s a preview of the “GDP budgetiser” spreadsheet maintained by the WhereDoesMyMoneyGo folk (and described here):

Google spreadsheet preview

It seems to me that if data is being published in a Google doc, then in some situations it might also make sense to either link to, or display, a sample of the data so that folk can check that it meets their expectations before they download it. (I’ve noticed, for example, that even with CSV, many browsers insist on downloading the doc in response to MIME type or server streaming settings so that you then have to open it up in another application, rather than just letting you preview it in the browser. Which is to say, if you have to keep opening up docs elsewhere, it makes browsing hard and can be a huge time waster. It can also be particularly galling if the downloaded file contains data that you’re not interested in, particularly when it’s a large file you’ve downloaded.)

Just by the by, I had thought that Google did a spreadsheet previewer that could take a link to an online spreadsheet or CSV document and preview it in Google Spreadsheets, but I must have misremembered. The Google Docs Viewer only seems to preview “PDF documents, PowerPoint presentations, and TIFF files”. For reference, the URL pattern is of the form http://docs.google.com/viewer?url=ESCAPED_PDF_URL

However, the Zoho Excel Viewer does preview public online Excel docs, along with CSV and OpenOffice Calc docs, using the URL pattern: http://sheet.zoho.com/view.do?url=SPREADSHEET_URL. (Apparently, you can also import docs into your Zoho account using this construction/: http://sheet.zoho.com/import.do?url=SPREADSHEET_URL). So for example, here’s a preview of the meetings that Cabinet Office ministers have had recently (via the new Number 10 Transparency website):

Previewing CSV in Zoho

Finally, one of the issues we’ve been having on WriteToReply is how to handle data nicely. The theme we’re using was conflicting (somehow) with Google spreadsheet embed codes, but when I came across this shortcode plugin earlier today, it struck me that it might help…? It simple takes a Google spreadsheet key, (using the pattern [gdoc key=”ABCDEFG”]) and then inserts the necessary HTML embedding tags: WordPress plugin: inline Google Spreadsheet Viewer

However, it also struck me that a few tweaks to that plugin could probably also provide a preview view of the data, showing for example the first 10 lines or so of data in a data file. Providing such a preview view over a sample of data in a data file, maybe in a by default collapsed section of a page, might be something worth exploring in CKAN and data.gov.uk data catalogue pages?

PS it just occurred to me: Scraperwiki offers just this sort of data preview:

Scraperwiki data preview

Maybe the UI patterns are starting to form and will then start to fall into place…?;-)

UPDATE 28/11/10: I just noticed that the data.gov.uk site is now offering a link to preview xls docs at least on Zoho:

Spreadsheet preview on data.gov.uk

Useful…:-)

UK Open Data Guidance Resources

This is a live post where I will try to collect together advice relating to the release and use of open public data in the UK, as much for my own reference as anything… (I guess this really should be a wiki page somewhere…?)

“Top Level” URL Conventions in Local Council Open Data Websites

A few days ago, I had reason to start pondering URI schemes for open data released by educational institutions. The OU, like a couple of other HEIs, is looking at structuring – and opening up – various sorts of data, and there are also mutterings around what a data.ac.uk styled site might have to offer.

Being a lazy sort, it seems to me that in figuring out how we might collate data from across the ac.uk environment, we could look to the gov.uk environment. So for example, data.gov.uk as a central index over data from both central and local government, which each have their own concerns, and within a type, are likely to share some common features: all local councils will have some of the same sort of data to share, government departments might share some requirements for consistent, centralised reporting (such as website costs and usage) as well their own peculiar data releases, and so on. In the ac.uk context, we have the HEIs (and FE colleges) in one set, research councils and other project related funding bodies in another.

If we look to local council data, we can also spot intermediate layers appearing that apply a canonical structure to a range of variously published data from the local councils. For example, Openly Local is making a play to act as the canonical source for a whole range of local council data across all the UK’s councils; the Police API “allows you to retrieve information about neighbourhood areas in all 43 English & Welsh police forces”, RateMyPlace is a “one stop shop for information on Food Safety Inspections in Staffordshire”, aggregating information from several councils and representing it via a single API, and so on. (For an example of how different councils can publish ostensibly the same data in a wie variety of formats, see Library Location Data on data.gov.uk).

Looking at the list of local councils with open data sites as collected on the OpenlyLocal open data scoreboard (and as extracted from theOpenlyLocal API via a Yahoo Pipe), are any conventions appearing to emerge in the location of local council open data homepages?

http://www.aberdeencity.gov.uk/open_data/open_data_home.asp (Aberdeen City Council)
http://www.bournemouth.gov.uk/Data/ (Bournemouth Borough Council)
http://www.bristol.gov.uk/opendata (Bristol City Council)
http://www.darlington.gov.uk/Generic/Info/opendata.htm (Darlington Borough Council)
http://www.eaststaffsbc.gov.uk/opendata/Pages/default.aspx (East Staffordshire Borough Council)
http://eastsussex.gov.uk/about/standards/opendata.htm (East Sussex County Council)
http://www.eden.gov.uk/about-this-site/open-data/ (Eden District Council)
http://data.london.gov.uk/ (Greater London Authority)
http://picandmix.org.uk/ (Kent County Council)
http://www2.lichfielddc.gov.uk/data/ (Lichfield District Council)
http://data.lincoln.gov.uk/ (Lincoln City Council)
http://www.brent.gov.uk/xml (London Borough of Brent)
http://www.hillingdon.gov.uk/data (London Borough of Hillingdon)
http://www.sutton.gov.uk/index.aspx?articleid=10077 (London Borough of Sutton)
http://www.rbwm.gov.uk/web/transparency.htm (Royal Borough of Windsor and Maidenhead)
http://www.salford.gov.uk/opendata.htm (Salford City Council)
http://www.stratford.gov.uk/opendata (Stratford-on-Avon)
http://www.sunderland.gov.uk/localpublicdata (Sunderland City Council)
http://www.trafford.gov.uk/opendata/ (Trafford Council)
http://opendata.walsall.org.uk/ (Walsall Metropolitan Borough Council)
http://opendata.warwickshire.gov.uk/ (Warwickshire County Council)
http://www.westberks.gov.uk/index.aspx?articleid=20365 (West Berkshire Council)

With only a small number of councils fully engaged, as yet, with open data, no dominant top level naming scheme has yet appeared, although there are a couple of early runners:

As yet, there is no agreement on the following naming approaches:

Several other councils appear to be offering a specific page to handle (at the moment) open data issues (e.g. http://www.salford.gov.uk/opendata.htm or http://www.westberks.gov.uk/index.aspx?articleid=20365), or even separate domains for their data site (e.g. http://picandmix.org.uk/)

Does any of this matter? At the top level, I’m not sure it does, except in setting expectations and providing a sound footing for a scaleable URI scheme. The Cabinet Office Guidance on designing URI sets, which outlines many considerations that need to be taken into account when defining URI schemes particularly for use as identifiers in RDF inspired Linked Data, suggests that domains should “[e]xpect to be maintained in perpetuity” and that “the choice of domain should provide the confidence to the consumer, …, the domain itself … convey[ing] an assurance of quality and longevity.”

In the foreseeable future, I suspect that (pragmatically) it is likely that the majority of data that will be released in the short term will be published as Excel spreadsheets or inforamlly formatted CSV/TSV data, with some sites publishing raw XML. (As Library Location Data on data.gov.uk describes, even when councils ostensibly release the same sort of data, there is no guarantee that they will do it in similar ways: of the 5 councils publishing the locations of local libraries, 5 different data formats were used… ) It is unlikely that councils will be early adopters of Linked Data across the board. (If they were, it might be seen as excluding users in the short term, because while many people are familiar with working with spreadsheets (a widely adopted “end user” technology for people who work with data in their day job), familiar routes in to and out of Linked Data stores are not there yet…) That said, if local councils do end up wanting to publish data with well formed URIs into the Linked Data space, it would be handy if their current URI scheme was designed with that in mind, and in such a way that the minting of future Linked Data URIs isn’t likely to conflict or clash.

Principles for, and Practicalities of, Open Public Data

Following the first meeting of the Public Sector Transparency Board last week, which is tasked with “driv[ing] forward the Government’s transparency agenda, making it a core part of all government business, a set of 11 draft public data principles have been posted for comment on the data.gov.uk wiki: Draft Public Data Principles [wiki version]

Following the finest Linked Data principles, each draft principle has its own unique URI… err, only it doesn’t… ;-) [here’s how they might look on WriteToReply – WTR: Draft Public Data Principles – with unique URLs for each principle;-)]

The principles broadly state that users have a right to open public data, and that data should be published in ways that make it useable and useful (so machine readable, not restrictively licensed, easily discoverable and standards based, timely and fine grained). In addition, data unerlying government websites will be made available (as for example in the case of the DirectGov Syndication API?) Other public bodies will be encouraged to publish inventories of their data holdings and make it available for reuse.

A separate blog post on the data.gov.uk blog describes some of the practical difficulties that local government offices might face when opening up their data: Publishing Local Open Data – Important Lessons from the Open Election Data project (Again, unique URLs for individual lessons are unavailable, but here’s an example of how we might automatically generate identifiers for them;-) WTR: Lessons Learned from the Open Election Data Project). The lessons learned include a lack of corporate awareness about open data issues (presumably there is a little more awareness since the PM’s letter to councils on opening up data), a lack of web skills and web publishing resources, not to mention a very limited understanding of, and tools available for the handling of, Linked Data.

As to what data might be opened up, particularly by local councils, Paul Clarke identifies several different classes of data (There’s data, and there’s data):

historical data;
planning data;
infrastructural data;
operational data.

My own take on it can be seen here:


(An updated version of this presentation, with full annotations, should be available in a week or two!)

Looking around elsewhere, Local government data: lessons from London suggests:

– “don’t start hiring big, expensive consultancy firms for advice”;
– “do draw on the expertise and learning already there”;
– “do remember that putting the data out there of itself is not enough – it must be predicated on a model of engagement.”

Picking up on that last point, the debate regarding the “usefulness” of data has a long way to run, I think? Whilst I would advocate lowering barriers to entry (which means making data discoverable, queryable (so particular views over it can be expressed), and available in “everyday” CSV and Excel spreadsheet formats) there is also the danger that if we put the requirement for data to be demonstrably useful to publishers, this will deter them from opening up data they don’t perceive to be useful. In this respect, the Draft Public Data Principle that states:

Public data policy and practice will be clearly driven by the public and businesses who want and use the data, including what data is released when and in what form – and in addition to the legal Right To Data itself this overriding principle should apply to the implementation of all the other principles.

should help ensure that an element of “pull” can be used to ensure the release of data that others know how to make useful, or need to make something else useful…

On the “usefulness” front, it’s also worth checking out Ingrid Koehler’s post Sometimes you have to make useful yourself, which mentions existence value and accountability value as well as value arising from “meta” operations such as the ability to compare data across organisation operating in similar areas (such as local councils, or wards within a particular council).

For my own part, I’ve recently started looking at ways in which can can generate transparency in reporting and policy development by linking summary statistics back to the original data (e.g. So Where Do the Numbers in Government Reports Come From?), a point also raised in Open data requires responsible reporting… and the comments that follow it). Providing infrastructure that supports this linkage between summary reported data and the formulas used to generate those data summaries is something that I think would help make open data more useful for transparency purposes, although it sits at a higher level than the principles governing the straightforward publication and release of open public data.

See also:
Ben Goldacre on when the data just isn’t enough without the working…
Publishing itemised local authority expenditure – advice for comment

So Where Do the Numbers in Government Reports Come From?

Last week, the COI (Central Office of Information) released a report on the “websites run by ministerial and non-ministerial government departments”, detailing visitor numbers, costs, satisfaction levels and so on, in accordance with COI standards on guidance on website reporting (Reporting on progress: Central Government websites 2009-10).

As well as the print/PDF summary report (Reporting on progress: Central Government websites 2009-10 (Summary) [PDF, 33 pages, 942KB]) , a dataset was also released as a CSV document (Reporting on progress: Central Government websites 2009-10 (Data) [CSV, 66KB]).

The summary report is full of summary tables on particular topics, for example:

TABLE 1: REPORTED TOTAL COSTS OF DEPARTMENT-RUN WEBSITES
COI web report 2009-10 table 1

TABLE 2: REPORTED WEBSITE COSTS BY AREA OF SPENDING
COI web report 2009-10 table 2

TABLE 3: USAGE OF DEPARTMENT-RUN WEBSITES
COI website report 2009-10 table 3

Whilst I firmly believe it is a Good Thing that the COI published the data alongside the report, there is a still a disconnect between the two. The report is publishing fragments of the released dataset as information in the form of tables relating to particular reporting categories – reported website costs, or usage, for example – but there is no direct link back to the CSV data table.

Looking at the CSV data, we see a range of columns relating to costs, such as:

COI website report - costs column headings

and:

COI website report costs

There are also columns headed SEO/SIO, and HEO, for example, that may or may not relate to costs? (To see all the headings, see the CSV doc on Google spreadsheets).

But how does the released data relate to the summary reported data? It seems to me that there is a huge “hence” between the released CSV data and the summary report. Relating the two appears to be left as an exercise for the reader (or maybe for the data journalist looking to hold the report writers to account?).

The recently published New Public Sector Transparency Board and Public Data Transparency Principles, albeit in draft form, has little to say on this matter either. The principles appear to be focussed on the way in which the data is released, in a context free way, (where by “context” I mean any of the uses to which government may be putting the data).

For data to be useful as an exercise in transparency, it seems to me that when government releases reports, or when government, NGOs, lobbiests or the media make claims using summary figures based on, or derived from, government data, the transparency arises from an audit trail that allows us to see where those numbers came from.

So for example, around the COI website report, the Guardian reported that “[t]he report showed uktradeinvest.gov.uk cost £11.78 per visit, while businesslink.gov.uk cost £2.15.” (Up to 75% of government websites face closure). But how was that number arrived at?

The publication of data means that report writers should be able to link to views over original government data sets that show their working. The publication of data allows summary claims to be justified, and contributes to transparency by allowing others to see the means by which those claims were arrived at and the assumptions that went in to making the summary claim in the first place. (By summary claim, I mean things like “non-staff costs were X”, or the “cost per visit was Y”.)

[Just an aside on summary claims made by, or “discovered” by, the media. Transparency in terms of being able to justify the calculation from raw data is important because people often use the fact that a number was reported in the media as evidence that the number is in some sense meaningful and legitimately derived. (“According to the Guardian/Times/Telegraph/FT, etc etc etc”. To a certain extent, data journalists need to behave like academic researchers in being able to justify their claims to others.]

In Using CSV Docs As a Database, I show how by putting the CSV data into a Google spreadsheet, we can generate several different views over the data using the using the Google Query language. For example, here’s a summary of the satisfaction levels, and here’s one over some of the costs:

COI website report - costs
select A,B,EL,EN,EP,ER,ET

[For more background to using Google spreadsheets as a database, see: Using Google Spreadsheets as a Database with the Google Visualisation API Query Language (via an API) and Using Google Spreadsheets Like a Database – The QUERY Formula (within Google spreadsheets itself)]

We can even have a go at summing the costs:

COI summed website costs
select A,B,EL+EN+EP+ER+ET

In short, it seems to me that releasing the data as data is a good start, but the promise for transparency lays in being able to share queries over data sets that make clear the origins of data-derived information that we are provided with, such as the total non-staff costs of website development, or the average cost per visit to the blah, blah website.

So what would I like to see? Well, for each of the tables in the COI website report, a link to a query over the co-released CSV dataset that generates the summary table “live” from the original dataset would be a start… ;-)

PS In the meantime, to the extent that journalists and the media hold government to account, is there maybe a need for data journalysts (journalist+analyst portmanteau) to recreate the queries used to generate summary tables in government reports to find out exactly how they were derived from released data sets? Finding queries over the COI dataset that generate the tables published in the summary report is left as an exercise for the reader… ;-) If you manage to generate queries, in a bookmarkable form (e.g. using the COI website data explorer (see also this for more hints), please feel free to share the links in the comments below :-)