Why Open Data Dumps On Their Own Add Little to Transparency…
“2012 will be the year where folk realise there’s more to transparent data release than dropping huge data tables”
Along with the “it’ll bootstrap innovation” chant, one of the oft-made claims about the release of open public data is that it’ll be a great boon to the “cause” of transparency. Publishing data, in electronic form, under an open license, is a start, but when it comes to actually trying to make use of public data releases, it can often be a long hard slog, from coping with non-obvious character encodings and data layouts that are all over the place, to reconciling column headings and sheet numbers with explanatory keys provided in a separate document, to trying to make sense of spreadsheet cell formats that mask the form the data is actually in, to [ADD YOUR FAVOURITE BUGBEAR HERE]…
As an example of how crazy things can get, take this tweet from @objectgroup yesterday:
"Now I'm FOI requesting better guidance on getting useful data out of COINS http://www.whatdotheyknow.com/request/cash_flow_and_balance_sheet_for#outgoing-173108"
Here are the highlights (for me) of that request:
Thank you for your reply to my Freedom of Information Request.
You replied to say that the information I requested would be available in the next COINS release on the data.gov.uk website.
However the guidance provided to the COINS data …is insufficient to extract the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in
Could you either:
1) update the guidance so it includes a section on how to extract the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in the WGA
2) add the cash flow, balance sheet and profit and loss statement for each of the 1,500 public bodies included in the WGA to the COINS page of data.gov.uk
So what’s the solution? Many software developers are familiar with the notion of an API, an online webservice that computer programmes can talk to. Services like Facebook publish comprehensive APIs that let third party developers build services on top of the Facebook platform, pulling data from and writing data to it. To make life easier for the developers, API publishers often publish software libraries that make it easy to use the API, or code examples that show how to achieve some of the tasks that can at least get you started with the API (for example, The Six Pillars of Complete Developer Documentation or Web API Documentation Best Practices).
So in the context of open data, how can we make life easier? Publishing example use cases, even really, really simple ones (especially really, really simple ones!) along with the data is one way, and it achieves at least the following:
1) you actually have to use the data you’re releasing, even if just in a toy way. So if you find a problem with accessing the data, or how the data is represented, chances are your users will find a problem with it too. And it might be that there’s a really quick fix to the problem. Like fixing a broken link, or checking the filetype… (As a practical step, try this: if you publish a spreadsheet via a link, click on the link, download the file, and just see if you can open it… Or see if you can view it using something like Zoho viewer.)
2) you can show your working. Reports often contain summary data tables or charts generated from raw data sets. The tables and charts appear in PDF documents, and the raw data is dumped as one or more spreadsheets or database tables. The query that is used to generate the summary data table, or chart, or as in the case above, the profit and loss, is typically not released. And that’s the bit that needs to be transparent at least as much as, if not more so, than the data. (I referred to this as a query path in Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; see also So Where Do the Numbers in Government Reports Come From?).
PS with a bit of luck, the new UK Gov Open Standards Hub will play some sort of role in identifying and improving best practice in data release, and maybe also in raising awareness of good practice conventions that can make life easier for users…