Posts Tagged ‘opendata’
Accessing and Visualising Sentencing Data for Local Courts
A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]
In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…
…but first, a couple of observations:
- the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
- the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?
The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.
The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…
Unix Command Line Tools
I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).
In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:
head recordlevel.csv
Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:
grep -i wight recordlevel.csv > recordsContainingWight.csv
(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )
Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.
Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):
We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:
Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?
R/RStudio
If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.
Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:
recordlevel <- read.csv("~/data/recordlevel.csv")
iw=subset(recordlevel,grepl("wight",court,ignore.case=TRUE))
We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:
age=table(iw$AGE)
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")
We can also start to look at combinations of factors. For example, how do offence types vary with age?
ageOffence=table(iw$AGE, iw$Offence_type)
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))
If we remove the beside=T argument, we can produce a stacked bar chart:
barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))
If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:
require(ggplot2)
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')
Alternatively, we can break down offence types by age:
ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)
We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:
ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)
One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)
PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:
PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
iwp=tapply(iw$Offence_type,iw$sex,function(x){prop.table(table(x))})
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
iwpMale=data.frame(iwp['Male'])
iwpFemale=data.frame(iwp['Female'])
We can then plot these percentages using constructions of the form:
ggplot(iwp2)+geom_bar(aes(x=Male.x,y=Male.Freq))
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)
Quick Core Dump of Idle Thoughts on the “Making Open Data Real” Consultation
Quick core dump of thoughts, largely culled from things I’ve doodled before… Needs much more work, but time is running out on me… So any/all comments appreciated….
Notes
***1. Do the definitions of the key terms go far enough or too far?
“Public services are either provided by public bodies, or providers who have been funded, commissioned or established by statute to provide a service”
I assume the definitions of open data and public services are to be taken together, with the consultation focussing on ‘open (public) data produced by public services’? For such bodies, I assume there is also a formal “data burden” that defines the public data reporting requirements to the centre, as well as devolved data burdens eg into local government from schools? Would it make sense to clarify the notion and extent of data burdens, and the extent to which elements of these (and the organisations they apply to) should be subject to open data requirements? I guess there is also a data burden placed on individual citizens in respect of filing tax forms, for example, that are not subject to openness requirements?
A clear statement at least of data burdens/formal reporting requirements between public bodies that are in scope for mandatory release as open public data should be made available, eg along the lines of http://www.communities.gov.uk/localgovernment/decentralisation/tacklingburdens/singledatalist/ http://getthedata.org/questions/500/data-burden-on-uk-higher-education (I know some work has already been done on this that I used as the basis for a simple data brden visualisation exercise ( http://www.flickr.com/photos/psychemedia/5536836259/ ).)
“Dataset”
It may be useful to distinguish between data collected for operational, administrative or statistical use, as well as the extent to which data produced in the normal course of events is being legitimately requested as is, or whether it must be processed before release (eg http://www.adls.ac.uk/what-is-administrative-data-and-why-use-it-for-research/ http://www.unsiap.or.jp/ms/ms7/DennisP1_OppoChalle.pdf ).
It may also be worth distinguishing between the release of complete data sets, views over the data that represent a query on a complete dataset, and queries, sampling procedures or any other means that are used to generate those data views. For example, providing data relating to performance indicators for a particular school in response to an FOI request from a citizen equates to the provision of a particular view over the database containing perfomance indicators for UK schools as a whole; providing a copy of the database as a whole to a developer of a school comparison website represents the provision of a complete dataset.
Datasets may provide value to others in a variety of ways: for example:
- using complete datasets as the basis of comparison or recommendation services;
- using complete datasets to support statistical analyses, segmentation/clustering of data;
- generating very particular or specific views over the dataset by constructing meaningful and appropriate queries on the datasets. Queries are also reusable, and whilst some cost may be incurred in creating them, making them open, and suitably parameterising them, the marginal cost of reusing the queries is then minimal. It is possible that queries that take a long time to create/optimise become valuable in their own right, and that the dataset and the view can be given away freely. The query unlocks value in the dataset and delivers it to the requester. When it comes to government reporting, where reports include summary views over open datasets, the openness/transparency requirement should not deem to be met unless the query that generates the view from the dataset is also openly published.
Datasets may also include recordsets relating to an individual; where personal access to personal data/mydata is possible, we need to distinguish between the private/personal right for an indvidual to access their data, or an agent acting on their behalf and with their permission, as opposed to general public access.
****2. Where a decision is being taken about whether to make a dataset open, what tests should be applied?
If data is part of a formally defined data burden, should that data burden be tiered in terms of openness requirements, for example along lines of:
- open on submission to the centre;
- open following embargo period and subject to checking by the centre, but with the presumption that it will be opened;
- not open;
Where data is FOIable, that may be taken as evidence in favour of presumed openness. If data is regularly requsted via FOI, it could be made available in open form as a matter of course in order to reduce FOI overheads in the future. When data is released via FOI, it could be made available via an open data site in partial fulfilment of handling the FOI request. When responding to FOI requests for data, the process required to obtain and release that data could be captured and compared with the actual processes relating to operational and administrative use of that data in order to identify whether an open data tap can be introduced into the current data process to open it as a matter of course, or release it efficiently in response to an FOI request.
As the major producer and consumer of public data, public bodies are well placed to benefit from more open public data. “Publicness” and “openness” both help make data accessible for use within and between public bodies, we well as reuse by third parties; accessibility is also improved by timely release of data, and the publication of data using open standards and formats.
Consequences of making data open should also be considered; for example, once released, will there be continued access to regular updates of the data using the same format. (If the data is released sporadically and with inconsistent formats, services that automate the regular collection of the data are not really viable).
****3. If the costs to publish or release data are not judged to represent value for money, to what extent should the requestor be required to pay for public services data, and under what circumstances?
Where work must be done that does not represent value for money (what would an example of this be? Having to get data into a form the public body would never use?), it may be appropriate to consider the amount of value that is added in processing the data that the requester might otherwise be expected to add, for example as just reward for the cost of processing that data. If the raw data is open, and the requester asks for processed data, it may be appropriate to give the raw data away freely but charge for the value add of processing it that the requester seeks to exploit in the course of their business? However, there will also be a tension between people who want to gain access to a small amount of data, either for personal use, research/innovation purposes, and companies who make use of that data in volume as part of a business. In the latter case, we might expect some payment for use of the data once the business is operating, although it could be argued that if the business is profitable, there is a return built in through taxation.
A balance may need to be struck based on the number of independent requests that are likely to be received for a particular data set and the use they wish to put it to. If N requests are made for the data, and all N parties need to do the same work cleaning or processing the data in the same way, that is obviously inefficient. It may be that third parties process and repackage data, for a fee. But the question arises – if data as published is not fit for use by third parties, is it fit for use by the first (producer) or second (‘official consumer’) party, or has the data been produced solely in response to some openness criteria, and not because the data is actually used for anything?
The ability to save cost elsewhere in government may also be an issue. For example, local authorities who make disbursements to care homes need to mitigate against fraud by regularly checking death reports, often through the purchase of commercial death registers or by checking the local newspaper’s death notices. Whilst a cost may be associated with signatories of death certificates ensuring this data enters the public body data chain in an accessible and open way, it may well save costs in multiple other areas of government.
Where data is processed and released in exchange for a payment, would it also be possible for the raw underlying data to also be made available for free so that third parties can, at their own expense, carry out the required processing if they can do so for less overall cost than piecewise purchase of data from the public body?
****4. How do we get the right balance in relation to the range of organisations (providers of public services) our policy proposals apply to? What threshold would be appropriate to determine the range of public services in scope and what key criteria should inform this?
If an organisation is subject to FOI requests, or data it produces and returns as part of an official data burden may be requested through FOI requests, it should be in scope?
Analysis of data processes associated with fulfilling data burden requirements might provide a basis for identifying where in a data process data might reasonably be made public and open.
*****5. What would be appropriate mechanisms to encourage or ensure publication of data by public service providers?
If data related FOI responses are published via open data sites, the open data site can become a repository of commonly requested data and help identify which processes might benefit from releasing open data as a matter of course.
Where public data is reported as a matter of course by the local press and in the local interest, (for example, court reports, planning notices, traffic notices), public bodies might be encouraged to publish the corresponding data in an open way in order to facilitate the local dissemination of that information. Note that much of this data is transitory/may only be relevant for a limited period. In this case, we need to consider: whether there is a public interest in making the data publicly available and open on an archival basis, or not providing archives per se, but responding to requests for archival copies of data; the extent to which third parties can archive/aggregate such data and continue to make it available; whether there are privacy reasons for not supporting archival access (for example, court reports in local newspapers have a “short memory”).
Are there guidelines available that cover the interactions between things like:
- data eligible for release under FOI;
- data that may be redacted on grounds of Data Protection Act
- data covered by Database Right or data that is covered by copyright
- data released through National Statistics ( http://www.legislation.gov.uk/ukpga/2007/18/contents )
- reusable public sector information ( http://www.legislation.gov.uk/uksi/2005/1515/contents/made )
Analysis of data burden reporting process might identify appropriate points at which data can be made open as part of the process. For example, reported data may be posted to an open data site from where it is collected (“pull reporting”). See also: http://blog.ouseful.info/2011/03/18/open-data-processes-taps-query-pathsaudit-trails-and-round-tripping/
And as I responded to the PDC Engagement Exercise, [o]ne particular class of data that interests me is data that is:
1) reported by a local organisation to a central body;
2) using a standardised, templated reporting format,
3) and that is FOIable either from the local organisation, and/or from the central body.
For example, in Higher Education, this might include data on library usage as reported to SCONUL, or marketing information about courses submitted to UCAS.
It can often be hard to find out how to phrase an FOI request to obtain this data as submitted, unless you know the type of reporting form used to submit it.
What I would like to see is the Public Data Corporation acting in part as a Public Data Exchange Directory, showing how different classes of public organisation make standard (public data containing) reports to other public organisations, detailing the standard report formats, with names/identifiers for those forms if appropriate, and describing which sections of the report are FOIable. This could also link in to the list of local council data burdens, for example ( http://www.communities.gov.uk/… and/or the code of practice for local authority transparency ( http://www.communities.gov.uk/… )
The next step would be to introduce a pubsub (publish-subscribe) model in the reporting chain for reporting documents* that are wholly FOIable. This could happen in several ways:
A) /open report publication/ – the publishing organisation could post their report to their opendata reporting store, and the consuming organisation (the one to which the report was being made) would subscribe to that store, collecting the data from there as it was published; third parties could also subscribe to the local publishing store and be alerted to reports as they are published. If co-publication to the central organisation and the public is not appropriate, the report could be witheld from public/press consumption for a specified period of days, or published to the press but not the public under embargo.
B) /open deposit/ – the publishing organisation publishes the report/data to an open deposit box owned by the central organisation which is receiving the report. After a specified period of time, the report is made public (ie published) via that central deposit box.
C) /data corp in the middle/ – a centralised architecture in which local organisations submit public reports to a Public Data Exchange, which then passes them on to the central body to which reports are made, and publishes them to the public, maybe after a fixed period of time.
The intention of all three approaches described above is to provide an open window onto the reporting chain. At the current time, open public data tends to be data that is published via a separate branch “to the public”. In contrast, the above approach suggests that public data publication acts as a view onto all or part of the data as it goes about it’s daily business being published from one organisation to another. That is, public data publication becomes a “tap” onto a dataflow/workflow process.
If one of the desires for data exploitation is to help introduce efficiencies as well as reuse in data related activities, third parties need to be able to work with data as it currently used.
***How will we ensure that Open Data standards are embedded in new ICT contracts?
By providing a test suite as part of the contract that include tests such as running data import/export/query operations against centralised validation services.
***What is the best way to achieve compliance on high and common standards to allow usability and interoperability?
Require data reporting to proceed through open interfaces or interfaces where public data taps can be applied. Released data should be authentic, and representative of data used as part of a public body’s activities or reporting duties rather than data that is produced purely for release on an open data site.
***How would we ensure that public service providers in their day to day decisionmaking honour a commitment to Open Data, while respecting privacy and security considerations.
Take a lead from open source software projects and publish requests via an issue tracker, that can show when an ‘issue’ was raised, what it’s current status is, and how it was resolved. Related approaches include services like WhatDoTheyKnow or GetTheData
***How should public services make use of data inventories? What is the optimal way to develop and operate this?
If we distinguish between datasets, queries on datasets, and reports/data view generated by queries on datsets on the one hand, and data burdens on the other, we can start to map out how queries are used on datasets to generate reports that fulfil data burden requirements. That has the benefit of making the data burden fulfilment process more transparent, as well as contextualising both the way those reports are generated (through exposing the queries) and the original data sets used as a basis for creating reports.
***Should the data that government releases always be of high quality? How do we define quality? To what extent should public service providers “polish” the data they publish, if at all?
One rule of thumb is that the data should be “good enough”. The question then arises, ‘good enough for whom?’. If the data is released and never referred to, its quality is irrelevant as regards the non-existent on-users, although it may signal problems elsewhere. If data is used by a third party and found to contain errors or omissions, the question arises: does the publisher also suffer from those some lack of quality issues (and if so, how are they handling them?); or are they using a different data set as part of the process that the released dataset relates to (and if so, why isn’t that data being released?)
There are different levels of cleanliness we may associate with data: a major issue in many datasets relates to the use of inconsistent labels to refer to the same entity (something that can be addressed by using universal persistent identifiers). Character set encodings can also cause problems, especially where it is hard to identify what character sets are used within a file.
***How should government approach the release of existing data for policy and research purposes: should this be held in a central portal or held on departmental portals?
As I understand the current situation, public body reports often produce summary tables and as part of transparency requirements, release as public data raw datasets that are used to generate those summary tables. In such cases, the query used to generate the summary table from the raw data should also be published. The transparency does not come from releasing summary tables and saying “it summarises that pile of data”. It comes from saying – here is the summary, and here is how it was generated from that data, allowing the observer to check the assumptions of the query, redo the analysis, and so on.
Using services such as Google spreadsheets or Zoho spreadsheets, it is possible to provide a preview view over the data contained in a dataset made available as a simple CSV file (this approach is taken on some datastores). It is also possible to use services such as a Google spreadsheets as a database, and so provide a certain level of intermediate developer access to the raw data as if read access were made available to the database that sourced the released data (eg http://blog.ouseful.info/2010/11/19/government-spending-data-explorer/ ). A range of powerful hosted statistical analysis and visualisation tools are now available that can also provide a user interface layer over over data published in such environments (“analysis at the point of delivery”). For example, the popular R environment can provide an online statistical analysis UI to online hosted datasets via services such as http://www.stat.ucla.edu/~jeroen/ggplot2.html or http://www.rstudio.org/docs/server/getting_started These tools provide an intermediary step that allow interested parties to explore datasets in situ. Recent developments with the Linked Data API ( http://www.epimorphics.com/web/tools/linked-data-api.html ) offer similar capabilities, including the ability to share persistent URLs to queries that are applied to public Linked Data stores such as those hosted under the data.gov.uk umbrella.
****Is there a role for government to stimulate innovation in the use of Open Data? If so, what is the best way to achieve this?
Allow free access to public data for personal, research, social enterprise and SME commercial research/development purposes. If the service using the data ever becomes popular, worry about how to charge for it then…
Two New Cabinet Office Open Data Consultations: Data Policy and Making Open Data Real
Via the Guardian Datablog, I see that the Cabinet Office has just opened up a couple of consultations around open data:
- Consultation on Data Policy for a Public Data Corporation [homepage] [Consultation]
Here are the consultation questions (also available via SurveyMonkey: PDC consultation):
Chapter 4 – Charging for PDC information
- How do you think Government should best balance its objectives around increasing access to data and providing more freely available data for re-use year on year within the constraints of affordability? Please provide evidence to support your answer where possible.
- Are there particular datasets or information that you believe would create particular economic or social benefits if they were available free for use and re-use? Who would these benefit and how? Please provide evidence to support your answer where possible.
- What do you think the impacts of the three options would be for you and/or other groups outlined above? Please provide evidence to support your answer where possible.
- A further variation of any of the options could be to encourage PDC and its constituent parts to make better use of the flexibility to develop commercial data products and services outside of their public task. What do you think the impacts of this might be?
- Are there any alternative options that might balance Government’s objectives which are not covered here? Please provide details and evidence to support your response where possible.
Chapter 5 – Licensing
- To what extent do you agree that there should be greater consistency, clarity and simplicity in the licensing regime adopted by a PDC?
- To what extent do you think each of the options set out would address those issues (or any others)? Please provide evidence to support your comments where possible.
- What do you think the advantages and disadvantages of each of the options would be? Please provide evidence to support your comments
- Will the benefits of changing the models from those in use across Government outweigh the impacts of taking out new or replacement licences?
Chapter 6 – Regulatory oversight
- To what extent is the current regulatory environment appropriate to deliver the vision for a PDC?
- Are there any additional oversight activities needed to deliver the vision for a PDC and if so what are they?
- What would be an appropriate timescale for reviewing a PDC or its constituent parts public task(s)?
And the second consultation (which is probably worth reading in the context of the http://www.cabinetoffice.gov.uk/resource-library/open-public-services-white-paper [white paper PDF, feedback website]?)
- Making Open Data Real: A Public Consultation [homepage] [Consultation]
- Glossary of key terms [link]
- An enhanced right to data: how do we establish stronger rights for individuals, businesses and other actors to obtain, use and re-use data from public service providers? [link]
- Setting transparency standards: what would standards that enforce this right to data among public authorities look like? [link]
- Corporate and personal responsibility: how would public service providers be held to account for delivering open data through a clear governance and leadership framework at political, organisational and individual level? [link]
- Meaningful Open Data: how should we ensure collection and publication of the most useful data, through an approach enabling public service providers to understand the value of the data they hold and helps the public at large know what data is collected? [link]
- Government sets the example: in what ways could we make the internal workings of government and the public sector as open as possible? [link]
- Innovation with Open Data: to what extent is there a role for government to stimulate enterprise and market making in the use of open data? [link]
I haven’t had chance to read through the consultation docs yet, but I’ll try and comment somewhere, as well as responding…
The way the consultations are presented
As to the way the consultations are presented themselves, two approaches have been taken:
- the PDC consultation embeds documenents at chapter level hosted on Scribd in a preview widget, with questions made available via a Word document or via SurveyMonkey. There doesn’t appear to be an opportunity to comment on the BIS site that is hosting the PDC consultation, even though it’s a WordPress platform running the Commentariat2 theme. To my mind, the way this consultation has be published, it’s not really of the web, and, to use a technical term, feels a little bit horrible to me… Maybe they don’t want flame wars on the bis.gov.uk domain about “Charging for PDC information”?!;-)
- the Making it Real consultation is hosted on the data.gov.uk site, with HTML text split at “chapter” (section) level, and commenting at that level via a single bottom of the page comment box. Where documents take close reading, I think this makes commenting difficult: if you want to refer to specific, detailed points in the consultation document, I’d say it makes sense to be able to see comment at the point of reference. That is, the comment box needs to be where you can see the actual bit of text you are commenting on (which is one reason why we often annotate documents with marginalia, rather than on a separate piece of paper). Where the comment box is fixed at the bottom of the page, you need two windows open to have side by side commenting and viewing of the actual text you are commenting on.
If we hadn’t decided that things had moved on enough in the way consultations were being handled to close WriteToReply (WriteToReply is closing. Come get your data if you want it), I think there’s a good chance we would have hosted both these consultations… Maybe our thinking that WriteToReply had nudged things far enough was a bit hasty? (The digress.it theme is out there, but as yet hasn’t been trialled on a departmental basis, I don’t think, even though we did try to respond to the commissioned accessibility audit. (Are Scribd docs accessible?) Digress.it is running on the JISCPress site though.
(I’m suddenly fired up again by the thought that consultation docs could be so much more “of the web” as well as easier to engage with… Hmmm, when’s the closing date for these consultations? Maybe there is time for one last WriteToReply outing…?)
PS How did I miss out on subscribing to the Government Digital Service? e.g. Neil Williams on A vision for online consultation and policy engagement
So What’s Open Government Data Good For? Government and “Independent Advisers”, maybe?
Although I got an invite to today’s “Government Transparency: Opening Up Public Services” briefing, I didn’t manage to attend (though I’m rather wishing I had), but I did manage to keep up with what was happening through the #openuk hashtag commentary.
It all kicked off with the Prime Minister’s Letter to Cabinet Ministers on transparency and open data, which sets out the roadmap for government data releases over the coming months in the areas of health, education, criminal justice, transport and public spending; it also sets the scene for the forthcoming Open Public Services White Paper (see also the public complement to that letter: David Cameron’s article in The Telegraph on transparency).
The Telegraph article suggests there will be a “profound impact” in four areas:
- First, it will enable choice, particularly for patients and parents. …
- Second, it will raise standards. All the evidence shows that when you give professionals information about other people’s performance, there’s a race to the top as they learn from the best. …
- Third, this information is going to help us mend our economy. To begin with, it’s going to save money. Already, the information we have published on public spending has rooted out waste, stopped unnecessary duplication and put the brakes on ever-expanding executive salaries. Combine that with this new information on the performance of our public services, and there will be even more pressure to get real value for taxpayers’ money.
- But transparency can help with the other side of the economic equation too – boosting enterprise. Estimates suggest the economic value of government data could be as much as £6 billion a year. Why? Because the possibilities for new business opportunities are endless. Imagine the innovations that could be created – the apps that provide up-to-date travel information; the websites that compare local school performance. But releasing all this data won’t just support new start-ups – it will benefit established industries too.
David Cameron’s article in The Telegraph on transparency
All good stuff… all good rhetoric. But what does that actually mean? What are people actually going to be able to do differently, Melody?
As far as I can tell, the main business models for making money on the web are:
- sell the audience: the most obvious example of this is to sell adverts to the visitors of your site. The rate advertisers pay is dependent on the number of people who see the adds, and their specificity (different media attract different, possibly niche, audiences. If an audience is the one you’re particularly trying to target, you’re likely to pay more than you would for a general audience, in part because it means you don’t have to go out and find that focussed audience yourself.) Another example is to sell information about the users of your site (for example, banks selling shopping data).
- take a cut: so for example, take an affiliate fee, referral fee or booking fee for each transaction brokered through your site, or levy some other transaction cost.
Where data is involved, there is also the opportunity to analyse other peoples’ data and then sell analysis of that data back to the pubishing organisations as consultancy. Or maybe use that data to commercial advantage in put together tenders and approaches to public bodies?
When all’s said and done, though, the biggest potential is surely within government itself? By making data from one department or agency available, other departments or agencies will have easier access to it. Within departments and agencies too, open data has the potential to reduce friction and barriers to access, as well as opening up the very existence of data sets that may be being created in duplicate fashion across areas of government.
By consuming their own and each others’ open data, departments will also start to develop processes that improve the cleanliness and quality of data sets, (for example, see Putting Public Open Data to Work…? and Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping; Library Location Data on data.gov.uk gives examples of how the same data can be released in several different (i.e. not immediately consistent) ways).
I’m more than familiar with the saying that “the most useful thing that can be done with your data will probably be done by someone else”, but if an organisation can’t find a way to make use of its own data, why should anyone else even try?! Especially if it means they have to go through the difficulty of cleaning the published data and preparing it for first use. By making use of open data as part of everyday government processes: a) we know the data’s good (hopefully!); b) cleanliness and inconsistency issues will be detected by the immediate publisher/user of the data; c) we know the data will have at least one user.
Finally, one other thing that concerns me is the extent to which “the public” want access to data in order to provide choice. As far as I can tell, choice is often the enemy of contentment; choice can sow the seeds of doubt and inner turmoil when to all intents and purposes there is no choice. I live on an island with a single hospital and not the most effective of rural transport systems. I’d guess the demographics of the island skew old and poor. So being able to “choose” a hospital with performance figures better than the local one for a given procedure is quite possibly no choice at all if I want visitors, or to be able to attend the hospital as an outpatient.
But that’s by the by: because the real issues are that the data that will be made available will in all likelihood be summary statistic data, which actually masks much of the information you’d need to make an informed decision; and if there is any meaningful intelligence in the data, or its summary statistics, you’ll need to know how to interpret the statistics, or even just read the pretty graphs, in order to take anything meaningful form them. And therein lies a public education issue…
Maybe then, there is a route to commercialisation of public facing public data? By telling people the data’s there for you to make the informed choice, the lack of knowledge about how to use that information effectively will open up (?!) a whole new sector of “independent advisers”: want to know how to choose a good school? Ask your local independent education adviser; they can pay for training on how to use the monolithic, more-stats-than-you-can-throw-a-distribution-at one-stop education data portal and charge you to help you decide which school is best for your child. Want comforting when you have to opt for treatment in a hospital that the league tables say are failing? Set up an appointment with your statistical counsellor, who can explain to you that actually things may not be so bad as you fear. And so on…
Open Data Processes – Taps, Query Paths/Audit Trails and Round Tripping
A few quick thoughts on open data processes and how we might start to put some of all this open public data to work, maybe via transparent data processes, not least in the institutions that are publishing it all…
Data Taps
The idea behind a data tap is simple – just tap off a view of the data as one institution provides it to another:

Tapping data is part of the motivation behind using FOI requests to identify standard reporting forms that may be used as part of a white box (open and transparent) data exchange process.
Query Paths/Audit Trails
Query paths describe a process in which is it possible to see how a particular data view or set of summary data was obtained from a one or more data sources:

For an example use case, see So Where Do the Numbers in Government Reports Come From?.
Round Trips
Round tripping refers to the ability to regenerate a data source from a data report, as for example taking data out of an HTML table and popping it into a spreadsheet or database):

If common data fields are used across datasets, it may be possible to populate fields in one data “source” automatically from another:

Round tripping means that we can reuse data, once entered, to populate other reporting forms.
[See also: Open Data Handbook]
Getting Started With Local Council Spending Data
With more and more councils doing as they were told and opening up their spending data in the name of transparency, it’s maybe worth a quick review of how the data is currently being made available.
To start with, I’m going to consider the Isle of Wight Council’s data, which was opened up earlier this week. The first data release can be found (though not easily?!) as a pair of Excel spreadsheets, both of which are just over 1 MB large, at http://www.iwight.com/council/transparency/ (This URL reminds me that it might be time to review my post on “Top Level” URL Conventions in Local Council Open Data Websites!)
The data has also been released via Spikes Cavell at Spotlight on Spend: Isle of Wight.
The Spotlight on Spend site offers a hierarchical table based view of the data; value add comes from the ability to compare spend with national averages and that of other councils. Links are also provided to monthly datasets available as a CSV download.
Uploading these datasets to Google Fusion tables shows the following columns are included in the CSV files available from Spotlight on Spend (click through the image to see the data):
Note that the Expense Area column appears to be empty, and “clumped” transaction dates use? Also note that each row, column and cell is commentable upon…
The Excel spreadsheets on the Isle of Wight Council website are a little more complete – here’s the data in Google Fusion tables again (click through the image to see the data):
(It would maybe worth comparing these columns with those identified as Mandatory or Desirable in the Local Spending Data Guidance? A comparison with the format the esd use for their Linked Data cross-council local spending data demo might also be interesting?)
Note that because the Excel files on the Isle of Wight Council were larger than the 1MB size limit on XLS spreadsheet uploads to Google Fusion Tables, I had to open the spreadsheets in Excel and then export them as CSV documents. (Google Fusion Tables accepts CSV uploads for files up to 100MB.) So if you’re writing an open data sabotage manual, this maybe something worth bearing in mind (i.e. publish data in very large Excel spreadsheets)!
It’s also worth noting that if different councils use similar column headings and CSV file formats, and include a column stating the name of the council, it should be trivial to upload all their data to a common Google Fusion Table allowing comparisons to be made across councils, contractors with similar names to be identified across councils, and so on… (i.e. Google Fusion tables would probably let you do as much as Spotlight on Spend, though in a rather clunkier interface… but then again, I think there is a fusion table API…?;-)
Although the data hasn’t appeared there yet, I’m sure it won’t be long before it’s made available on OpenlyLocal:
However, the Isle of Wight’s hyperlocal news site, Ventnorblog teamed up with a local developer to revise Adrian Short’s Armchair Auditor code and released the OnTheWIght Armchair Auditor site:
So that’s a round up of where the data is, and how it’s presented. If I get a chance, the next step is to:
- compare the offerings with each other in more detail, e.g. the columns each view provides;
- compare the offerings with the guidance on release of council spending data;
- see what interesting Google Fusion table views we can come up with as “top level” reports on the Isle of Wight data;
- explore the extent to which Google Fusion Tables can be used to aggregate and compare data from across different councils.
PS related – Nodalities blog: Linked Spending Data – How and Why Bother Pt2
PPS for a list of local councils and the data they have released, see Guardian datastore: Local council spending over £500, OpenlyLocal Council Spending Dashboard
A Few More Thoughts on GetTheData.org
As we come up to a week in on GetTheData.org, there’s already an interesting collection of questions – and answers – starting to appear on the site, along with a fledgling community (thanks for chipping in, folks:-), so how can we maintain – and hopefully grow – interest in the site?
A couple of things strike me as the most likely things to make the site attractive to folk:
- the ability to find an appropriate – and useful – answer to your question without having to ask it, for example because someone has already asked the same, or a similar, question;
- timely responses to questions once asked (which leads to a sense of community, as well as utility).
I think it’s also worth bearing in mind the context that GetTheData sits in. Many of the questions result in answers that point to data resources that are listed in other directories. (The links may go to either the data home page or its directory page on a data directory site.)
Data Recommendations
One thing I think is worth exploring is the extent to which GetTheData can both receive and offer recommendations to other websites. Within a couple of days of releasing the site, Rufus had added a recommendation widget that could recommend datasets hosted on CKAN that seem to be related to a particular question.
What this means is that even before you get a reply, a recommendation might be made to you of a dataset that meets your requirements.
(As with many other Q&A sites, GetTheData also tries to suggest related questions to you when you enter you question, to prompt you to consider whether or not your question has already been asked – and answered.)
I think the recommendation context is something we might be able to explore further, both in terms of linking to recommendations of related data on other websites, but also in the sense of reverse links from GetTheData to those sites.
For example:
- would it be possible to have a recommendation widget on GetTheData that links to related datasets from the Guardian datastore, or National Statistics?
- are there other data directory sites that can take one or more search terms and return a list of related datasets?
- could a getTheData widget be located on CKAN data package pages to alert package owners/maintainers that a question possibly related to the dataset had been posted on GetTheData? This might encourage the data package maintainer to answer the question on the getTheData site with a link back to the CKAN data package page.
As well as recommendations, would it be useful for GetTheData to syndicate new questions asked on the site? For example, I wonder if the Guardian Datastore blog would be willing to add the new questions feed to the other datablogs they syndicate?;-) (Disclosure: data tagged posts from OUseful.info get syndicated in that way.)
Although I don’t have any good examples of this to hand from GetTheData, it strikes me that we might start to see questions that relate to obtaining data which is actually a view over a particular data set. This view might be best obtained via a particular query onto a particular data set. such as a specific SPARQL query on a Linked Data set, or a particular Google query language request to the visualisation API against a particular Google spreadsheet.
If we do start to see such queries, then it would be useful to aggregate these around the datastores they relate to, though I’m not sure how we could best do this at the moment other than by tagging?
News announcements
There are a wide variety of sites publishing data independently, and a fractured networked of data directories and data catalogues. Would it make sense for GetTheData to aggregate news announcements relating to the release of new data sets, and somehow use these to provide additional recommendations around data sets?
Hackdays and Data Fridays
As suggested in Bootstrapping GetTheData.org for All Your Public Open Data Questions and Answers:
If you’re running a hackday, why not use GetTheData.org to post questions arising in the scoping the hacks, tweet a link to the question to your event backchannel and give the remote participants a chance to contribute back, at the same time adding to the online legacy of your event.
Alternatively, how about “Data Fridays”, on the first Friday in the month, where folk agree to check GetTheData two or three times that day and engage in something of a distributed data related Question and Answer sprint, helping answer unanswered questions, and maybe pitching in a few new ones?
Aggregated Search
It would be easy enough to put together a Google custom search engine that searches over the domains of data aggregation sites, and possibly also offer filetype search limits?
So What Next?
Err, that’s it for now…;-) Unless you fancy seeing if there’s a question you can help out on right now at GetTheData.org
Open Data Sceptic(?!)
Answers appreciated in the comments below…;-)
PS a similar question comes to mind with OERs…;-)
PPS In a rare blog post(?!), @ambrouk reminds me of a recent post by Tom Steinberg about Open Data: How Not To Cock It Up. I must have been having a bad day, yesterday, and stand corrected… (or maybe I was wound up by one too many other tweets or blog posts about yet another open data launch making all sorts of vacuous promises and calls to the public for action around this data set that would obviously benefit them…. err…?!;-)
Government Spending Data Explorer
So… the UK Gov started publishing spending data for at least those transactions over £25,0000. Lots and lots of data. So what? My take on it was to find a quick and dirty way to cobble a query interface around the data, so here’s what I spent an hour or so doing in the early hours of last night, and a couple of hours this morning… tinkering with a Gov spending data spreadsheet explorer:
The app is a minor reworking of my Guardian datastore explorer, which put some of query front end onto the Guardian Datastore’s Google spreadsheets. Once again, I’m exploiting the work of Simon Rogers and co. at the Guardian Datablog, a reusing the departmental spreadsheets they posted last night. I bookmarked the spreadsheets to delicious (here) and use these feed to populate a spreadsheet selector:
When you select a spreadsheet, you can preview the column headings:
Now you can write queries on that spreadsheet as if it was a database. So for example, here are Department for Education spends over a hundred million:
The query is built up in part by selecting items from lists of options – though you can also enter values directly into the appropriate text boxes:
You can bookmark and share queries in the datastore explorer (for example, Education spend over 100 million), and also get URLs that point directly to CSV and HTML versions of the data via Google Spreadsheets.
Several other example queries are given at the bottom of the data explorer page.
For certain queries (e.g. two column ones with a label column and an amount column), you can generate charts – such as Education spends over 250 million:
Here’s how we construct the query:
If you do use this app, and find some interesting queries, please bookmark them and tag them with wdmmg-gde10, or post a link in a comment below, along with a description of what the query is and why its interesting. I’ll try to add interesting examples to the app’s list of example queries.
Notes: the datastore explorer is an example of a single web page application, though it draws on several other external services – delicious for the list of spreadsheets, Google spreadsheets for the database and query engine, Google charts for the charts and styled tabular display. The code is really horrible (it evolved as a series of bug fixes on bug fixes;-), but if anyone would like to run with the idea, start coding afresh maybe, and perhaps make a production version of the app, I have a few ideas I could share;-)


























Academic Library Usage Data as Reported to SCONUL, via FOI, And a Thought About Whitebox Data Reporting
with 2 comments
Something I’ve been meaning to do for ages, but only just got round to starting to do, is to send up trial balloon FOI requests around the data that one public organisation might release to other organisations as part of a formal or templated reporting procedure.
So here’s the first one – an FOI request to the University of Bath Library for a copy of the data they returned to SCONUL for the period 2008/2009 made via MySociety’s WhatDoTheyKnow service – and here’s the response, along with a copy of the return.
(In general, I wonder if it would be more useful to ask for a copy of the document if possible, in the document format it was submitted (for example, a Microsoft Word document, if that was the document type submitted).)
The information reported to SCONUL is not available from SCONUL for free, although aggregated data from across the UK HE sector is available via a paid for report. A copy of the current questionnaire used to collect the data is available.
It seems to me that what requests of this sort do is demonstrate a precedent regarding the release of data that is produced as part of a formal or standardised reporting process that can be used to encourage (oblige?) other institutions in the same sector to make the information available in the same way?
So here’s what I have in mind: a site that collects and collates information about standard reports that are used to transport information between public sector organisations (including copies of the forms used to collate that data), including but not limited to the information/data that public institutions are obliged to return to government or overseeing agencies.
For example, this DCLG list of of the minimum data central Government needs from local authorities is a good start – is there an equivalent for universities (come on, BIS…;-)? [Ah, maybe this is a place to start, at least as far as HESA goes: HEFCE Report 2008: Making your data work for you - Data quality and efficiency in higher education. I imagine there is also a considerable data burden arising from REF reporting?]
As and when reports are demonstrated to be FOIable, their contents also become candidates for open data release. One aim here is to start making data chains visible to the organisations that are producing the data (internal transparency) so that the organisation can become more aware of its own data resources and how they might be used elsewhere within the organisation. (Transparency within the organisation may also lead to a reduction in duplication of effort creating or collating the same data at several different locations within the same organisation?)
The claim I guess I’m making towards this approach to opening up data may be summarised as follows: data that is produced as part of formal reporting and that is FOIable should be made public as a matter of course. As a consequence, there should be little extra effort required to open up the data. Indeed, it may be possible to submit the reports via an open and transparent whitebox reporting process.
[See also: Putting Public Open Data to Work…?]
PS for what it’s worth, I think the SCONUL data application provides another example of a situation where it might be useful to have a WhatDoTheyKnow service that allows you to make the same (bulk) request to every institution in a particular sector (such as universities, or local councils). I can see there may need to be controls around such a service to prevent abuse, but
PPS I wonder, do MySociety license WhatDoTheyKnow to any public institutions to help they manage their FOI process?
PPPS Here’s a related comment I posted to the Public Data Corporation engagement exercise:
Written by Tony Hirst
March 18, 2011 at 10:29 am
Posted in CommentedElsewhere, Data, Policy
Tagged with FOI, opendata, Public Data Corporation, public open data