Posts Tagged ‘opendata’
A couple of weeks ago, I gave a presentation to the WebScience students at the University of Southampton on the topic of open data, using it as an opportunity to rehearse a view of open data based on the premise that it starts out closed. In much the same way that Darwin’s Theory of Evolution by Natural Selection is based on a major presupposition, specifically a theory of inheritance and the existence of processes that support reproduction with minor variation, so too does much of our thinking about open data derive from the presupposed fact that many of the freedoms we associate with the use of open data in legal terms arise from license conditions that the “owner” of the data awards to us.
Viewing data in this light, we might start by considering what constitutes “closed” data and how it comes to be so, before identifying the means by which freedoms are granted and the data is opened up. (Sometimes it can also be easier to consider what you can’t do than what you can, especially when answers to questions such as “so what can you actually do with open data?” attract the (rather meaningless) response: “anything”. We can then contrast what you can do in terms of freedom complementary to what you can’t…)
So how can data be “closed”?
One lens I particularly like for considering constraints that are placed on actions and actors, particularly in the digital world (although we can apply the model elsewhere) I first saw described by Lawrence Lessig in Code and Other Laws of Cyberspace: What Things Regulate: A Dot’s Life.
Here’s the dot and the forces that constrain its behaviour:
So we see, for example, the force of law, social norms, the market (that is, economic forces) and architecture, that is the “digital physical” way the world is implemented. (Architecture may of course be designed in order to enforce particular laws, but it is likely that other “natural laws” will arise as a result of any particular architecture or system implementation.)
Without too much thought, we might identify some constraints around data and its use under each of these separate lenses. For example:
- Law: copyright and database right grant the creator of a dataset certain protective rights over that data; data protection laws (and other “privacy laws”) limit access to, or disclosure of, data that contains personal information, as well as restricting the use of that data for purposes disclosed at the time it was collected. The UK Data Protection Act also underwrites the right of individuals to claim additional limits on data use, for example the rights “to object to processing that is likely to cause or is causing damage or distress to prevent processing for direct marketing; to object to decisions being taken by automated means” (ICO Guide to the DPA, Principle 6 – The rights of individuals).
- Norms: social mores, behaviour and taboos limit the ways in which we might use data, even if that use is not constrained by legal, economic or technical concerns. For example, applications that invite people to “burgle my house” based on analysing social network data to discover when they are likely to be away from home and what sorts of valuable product might be on the premises are generally not welcomed. Norms of behaviour and everyday workpractice also mean that much data is not published when theere are no real reasons why it couldn’t be.
- Market: in the simplest case, charging for access to data places a constraint on who can gain access to the data even in advance of trying to make use of it. If we extend “market” to cover other financial constraints, there may be a cost associated with preparing data so that it can be openly released.
- Architecture: technical constraints can restrict what you can do with data. Digital rights management (DRM) uses encryption to render data streams unusable to all but the intended client, but more prosaically, document formats such as PDF or the “release” of data charts are flat image files makes it difficult for the end user to manipulate as data any data resources contained in those documents.
Laws can also be used to grant freedoms where freedoms are otherwise restricted. For example:
- the Freedom of Information Act (FOI) provides a mechanism for requesting copies of datasets from public bodies; in addition, the Environmental Information Regulations “provide public access to environmental information held by public authorities”.
- the laws around copyright relax certain copyright constraints for the purposes of criticism and review, reporting, research, teaching (IPO – Permitted uses of copyright works);
- in the UK, the Data Protection Act provides for “a right of access to a copy of the information comprised in their personal data” (ICO Guide to the DPA, Principle 6).
- in the UK, the Data Protection Act regulates what can be done legitimately with “personal” data. However, other pieces of legislation relax confidentiality requirements when it comes to sharing data for research purposes. For example:
- the NHS Act s. 251 Control of patient information; for example, the Secretary of State for Health may “make regulations to set aside the common law duty of confidentiality for medical purposes where it is not possible to use anonymised information and where seeking individual consent is not practicable” (discussion). Note that they are changes afoot regarding s. 251…
- The Secretary of State for Education has specific powers to share pupil data from the National Pupil database (NPD) “with named bodies and third parties who require access to the data to undertake research into the educational achievements of pupils”. The NPD “tracks a pupil’s progress through schools and colleges in the state sector, using pupil census and exam information. Individual pupil level attainment data is also included (where available) for pupils in non-maintained and independent schools” (access arrangements).
- the Enterprise and Regulatory Reform Bill currently making its way through Parliament legislates around the Supply of Customer Data (the “#midata” clauses) which is intended to open up access to customer transaction data from suppliers of energy, financial services and mobile phones “(a) to a customer, at the customer’s request; (b) to a person who is authorised by a customer to receive the data, at the customer’s request or, if the regulations so provide, at the authorised person’s request.” Although proclaimed as a way of opening up individual rights to access this data, the effect will more likely see third parties enticing individuals to authorise the release to the third party of the individual first party’s personal transaction data held by a second party (for example, #Midata Is Intended to Benefit Whom, Exactly?). (So you’ll presumably legally be able to grant Facebook access to your mobile phone records… Or Facebook will find a way of getting you to release that data to them without you realising you granted them that permission;-)
Contracts (which I guess fall somewhere between norms and laws from the dot’s perspective (I need to read that section of Lessig’s book again!) can also be used by rights holders to grant freedoms over the data they hold the rights for. For example, the Creative Commons licensing framework provides a copyright holder with a set of tools for relaxing some of the rights afforded to them by copyright when they license the work accordingly.
Note that “I am not a lawyer”, so my understanding of all this is pretty hazy;-) I also wonder how the various pieces of legislation interact, and whether there are cracks and possible inconsistencies between them? If there are pieces of legislation around the regulation and use of data that I’m missing, please post links in the comments below, and I’ll try and do a more thorough round up in a follow on post.
I’m doing a couple of talks to undergrad and postgrad students next work – on data journalism at the University of Lincoln, and on open data at the University of Southampton – so I thought I’d do a quick round up of recently advertised data related jobs that I could reuse for an employability slide…
So, here are some of the things I’ve noticed recently:
- The Technology Strategy board, funders of many a data related activity (including the data vouchers for SMEs) are advertising for a Lead Technologist – Data Economy (£45,000 to £55,000):
The UK is increasingly reliant on its service economy, and on the ability to manage its physical economy effectively, and it exports these capabilities around the world. Both aspects of this are heavily dependent on the availability of appropriate information at the right place and time, which in turn depends on our ability to access and manipulate diverse sources of data within a commercial environment.
The internet and mobile communications and the ready availability of computing power can allow the creation of a new, data-rich economy, but there are technical, human and business challenges still to be overcome. With its rich data resources, inventive capacity and supportive policy landscape, the UK is well placed to be the centre of this innovation.
Working within the Digital team, to develop and implement strategies for TSB’s interventions in and around the relevant sectors.
This role requires the knowledge and expertise to develop priorities for how the UK should address this opportunity, as well as the interpersonal skills to introduce the relevant communities of practice to appropriate technological solutions. It also requires a knowledge of how innovation works within businesses in this space, to allow the design and targeting of TSB’s activities to effectively facilitate change.
Accessible tools include, but are not restricted to, networking and community building, grant-funding of projects at a wide range of scales, directing support services to businesses, work through centres such as the Open Data Institute and Connected Digital Economy Catapult, targeted procurement through projects such as LinkedGov, and inputs to policy. The role requires drawing upon this toolkit to design a coordinated programme of interventions that has impact in its own right and which also coordinates with other activities across TSB and the wider innovation landscape.
- Via the ECJ, a relayed message from the NICAR-L mailing list about a couple of jobs going with The Times and Sunday Times:
A couple of jobs that might be of interest to NICAR members here at the
Times of London…
The first is an investigative data journalist role, joining the new data journalism unit which will work across both The Times and The Sunday Times.
The other is a editorial developer role: this will sit within the News Development Team and will focus on anything from working out how we tell stories in richer more immersive ways, to creating new ways of presenting Times and Sunday Times journalism to new audiences.
Please get in touch if you are interested!
Head of news development, The Times and Sunday Times
Not a job ad as such, but an interesting recent innovation from the BirminghamMail:
We’ve launched a new initiative looking at the numbers behind our city and the stories in it.
‘Behind The Numbers’ is all about the explosion in ‘data’: information about our hospitals and schools, crime and the way it is policed, business and sport, arts and culture.
We’d like you to tell us what data you’d like us to publish and dig into. Email suggestions to firstname.lastname@example.org. Follow @bhamdatablog on Twitter for updates or to share ideas.
This was also new to me: FT Data, a stats/datablog from the FT? FullFact is another recent addition to my feed list, with a couple of interesting stories each day and plenty of process questions and methodological tricks that can be, erm, appropriated ;-) Via @JackieCarter, the Social Statistics blog looked interesting, but the partial RSS feed is a real turn off for me so I’ll probably drop it from my reader pretty quickly unless it turns up some *really* interesting posts.
Here are some examples of previously advertised jobs…
- A job that was being advertised at the end of last year (now closed) by the Office of National Statistics (ONS) (current vacancies) was for the impressive sounding Head of Rich Content Development:
The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.
1. Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
4. Works closely with business areas to identify, assess and commission new rich-content projects.
5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.
- From December 2010, a short term contract at the BBC for a data journalist:
The team is looking for a creative, tech-savvy data journalist (computer-assisted reporter) to join its website specials team to work with our online journalists, graphic designer and development teams.
Role Purpose and Aims
You will be required to humanize statistics; to make sense of potentially complicated data and present it in a user friendly format.
You will be asked to focus on a range of data-rich subjects relating to long-term projects or high impact daily new stories, in line with Global News editorial priorities. These could include the following: reports on development, global poverty, Afghanistan casualties, internet connectivity around the world, or global recession figures.
Key Knowledge and Experience
You will be a self-starter, brimming with story ideas who is comfortable with statistics and has the expertise to delve beneath the headline figures and explain the fuller picture.
You will have significant journalistic experience gained ideally from working in an international news environment.
The successful candidate should have experience (or at least awareness) of visualising data and visualisation tools.
You should be excited about developing the way that data is interpreted and presented on the web, from heavy number crunching, to dynamic mapping and interactive graphics. You must have demonstrated knowledge of statistics, statistical analysis, with a good understanding of the range and breadth of data sources in the UK and internationally, broad experience with data sources, data mining and have good visual and statistical skills.
You must have a Computer-assisted reporting background or similar, including a good knowledge of the relevant software (including Excel and mapping software).
Experience of producing and developing data driven web content a senior level within time and budget constraints.
Central to the role is an ability to analyse complicated information and present it to our readers in a way that is visually engaging and easy to understand, using a range of web-based technologies, for which you should have familiarity with database interfaces and web presentation layers, as well as database concepting, content entry and management.
You will be expected to have your own original ideas on how to best apply data driven journalism, either to complement stories when appropriate or to identify potential original stories while interpreting data, researching and investigating them, crunching the data yourself and working with designers and developers on creating content that will engage our audience, and provide them with useful, personalised information.
FWIW, it’s probably worth remembering that the use of data is not necessarily a new thing.. for example, this post – The myth of the missing Data Scientist – does a good job debunking some of the myths around “data science”.
Following the official opening of the Open Data Institute (ODI) last week, a flurry of data related announcements this week:
- A big one for stats fans with the release of 2011 Census data by the ONS: 2011 Census, Key Statistics for Local Authorities in England and Wales. A few charts appear to have made it into the mix (along with the data to generate them), which I guess sets the baseline for whoever lands the currently advertised Head of Rich Content at the ONS job…
The data files associated with press releases are published as Excel spreadsheets. I guess this reflects, in part, the need to come up with a container that can cope with all the metadata. It’s a bit of a pain, though. One thing I keep meaning to explore further are ways of bundling data in R packages, along with scripts for analysing and visualising the data so bundled (eg US Census Spatial and Demographic Data in R: The UScensus2000 Suite of Packages or US consumer expenditure survey (ce) in R). I probably should also look again at Google’s Dataset Publication Language (DSPL) as well as other packaging formats. I need to check out the latest major release from the W3C Provenance Working Group too…
- Over at BIS, £8 million of investment in open public data is announced, the major chunk of which goes to the Data Strategy Board (#datastrategy) Breakthrough Fund to help public bodies get over short term technical barriers to releasing open public data. I keep wittering on about mapping out data flows that already exist and then finding ways to tap into them directly, so won’t repeat that here;-) A smaller pot, administered by the ODI, will be available to SMEs via the Open Data Immersion Programme. Also announced, the Ordnance Survey will be widening the availability of its range of mapping data.
- Not sure if I missed this when it was presumably announced? The Data Strategy Board’s chair Stephan Shakespeare (CEO of YouGov Plc) is leading an independent review of public sector information (here are the (draft) terms of reference). I’m not sure how this review fits into the reports to the tangle of reporting lines associated with the Data Strategy Board and the Public Data Group (the latter seems to have been very quiet?). I also wonder where the ODI fits into that whole structure?
- The funding around public open data coincided with a written Ministerial statement form the Cabinet Office that provided an Update on Departmental Open Data Commitments and adherence to Public Data Principles (>original link on a gov.uk domain, h/t @owenboswarva). The update is spectacularly lacking in linking to any of the raw data that is summarised in the actual statement, so so much for any actual transparency there… The same minister, Francis Maude, has also been fulfilling his social media obligations with a piece in the Huffington Post on A Practical Vision for Open Government. (In other news, at the micro/pragmatic level of open public data, I’m still finding that week on week releases of NHS sitrep data show minor differences in formatting and occasional errors…)
Things have been moving on the Communications Data front too. Communications Data got a look in as part of the 2011/2012 Security and Intelligence Committee Annual Report with a review of what’s currently possible and “why change may be necessary”. Apparently:
118. The changes in the telecommunications industry, and the methods being used by people to communicate, have resulted in the erosion of the ability of the police and Agencies to access the information they require to conduct their investigations. Historically, prior to the introduction of mobile telephones, the police and Agencies could access (via CSPs, when appropriately authorised) the communications data they required, which was carried exclusively across the fixed-line telephone network. With the move to mobile and now internet-based telephony, this access has declined: the Home Office has estimated that, at present, the police and Agencies can access only 75% of the communications data that they would wish, and it is predicted that this will significantly decline over the next few years if no action is taken. Clearly, this is of concern to the police and intelligence and security Agencies as it could significantly impact their ability to investigate the most serious of criminal offences.
N. The transition to internet-based communication, and the emergence of social networking and instant messaging, have transformed the way people communicate. The current legislative framework – which already allows the police and intelligence and security Agencies to access this material under tightly defined circumstances – does not cover these new forms of communication. [original emphasis]
Elsewhere in Parliament, the Joint Select Committee Report on the Draft Communications Data Bill was published and took a critical tone (Home Secretary should not be given carte blanche to order retention of any type of data under draft communications data bill, says joint committee. “There needs to be some substantial re-writing of the Bill before it is brought before Parliament” adds Lord Blencathra, Chair of the Joint Committee.) Friend and colleague Ray Corrigan links to some of the press reviews of the report here: Joint Committee declare CDB unworkable.
In other news, Prime Minister David Cameron’s announcement of DNA tests to revolutionise fight against cancer and help 100,000 patients was reported via a technology angle – Everybody’s DNA could be on genetic map in ‘very near future’ [Daily Telegraph] – as well as by means of more reactionary headlines: Plans for NHS database of patients’ DNA angers privacy campaigners [Guardian], Privacy fears over DNA database for up to 100,000 patients [Daily Telegraph].
If DNA is your thing, don’t forget that the Home Office already operates a National DNA Database for law enforcement purposes.
And if national databases are your thing, there always the National Pupil Database which was in the news recently with the launch of a consultation on proposed amendments to individual pupil information prescribed persons regulations which seeks to “maximise the value of this rich dataset” by widening access to this data. (Again, Ray provides some context and commentary: Mr Gove touting access to National Pupil Database.)
PS A late inclusion: DECC announcement around smart meter rollout with some potential links to #midata strategy (eg “suppliers will not be able to use energy consumption data for marketing purposes unless they have explicit consent”). A whole raft of consultations were held around smart metering and Govenerment responses are also published today, including Government Response on Data Access and Privacy Framework, the Smart Metering Privacy Impact Assessment and a report on public attitudes research around smart metering. I also spotted an earlier consultation that had passed me by around the Data and Communications Company (DCC) License Conditions; here the response, which opens with: “The communications and data transfer and management required to support smart metering is to be organised by a new central communications body – the Data and Communications Company (“the DCC”). The DCC will be a new licensed entity regulated by the Gas and Electricity Markets Authority (otherwise referred to as “the Authority”, or “Ofgem”). A single organisation will be granted a licence under each of the Electricity and Gas Acts (there will be two licences in a single document, referred to as the “DCC Licence”) to provide these services within the domestic sector throughout Great Britain”. Another one to put on the reading pile…
Putting a big brother watch hat on, the notion of “meter surveillance” brings to mind BBC article about an upcoming (will hopefully thence be persistently available on iPlayer?) radio programme on “Electric Network Frequency (ENF) analysis”, The hum that helps to fight crime. According to Wikipedia, ENF is a forensic science technique for validating audio recordings by comparing frequency changes in background mains hum in the recording with long-term high-precision historical records of mains frequency changes from a database. In turn, this reminds me of appliance signature detection (identifying what appliance is switched on or off from its electrical load curve signature), for example Leveraging smart meter data to recognize home appliances. In context of audio surveillance, how about supplementing surveillance video cameras with microphones? Public Buses Across Country [US] Quietly Adding Microphones to Record Passenger Conversations.
Over the last year or two, I’ve given a handful of talks to postgrad and undergrad students broadly on the topic of “technology for data driven journalism”. The presentations are typically uncompromising, which is to say I assume a lot. There are many risks in taking such an approach, of course, as waves of confusion spread out across the room… But it is, in part, a deliberate strategy intended to shock people into an awareness of some of the things that are possible with tools that are freely available for use in the desktop and browser based sheds of today’s digital tinkerers… Having delivered one such presentation yesterday, at UCA, Farnham, here are some reflections on the whole topic of “#ddj”. Needless to say, they do not necessarily reflect even my opinions, let alone those of anybody else;-)
The data-driven journalism thing is being made up as we go along. There is a fine tradition of computer assisted journalism, database journalism, and so on, but the notion of “data driven journalism” appears to have rather more popular appeal. Before attempting a definition, what are some of the things we associate with ddj that might explain the recent upsurge of interest around it?
- access to data: this must surely be a part of it. In one version we might tell of the story, the arrival of Google Maps and the reverse engineering of an API to it by Paul Rademacher for his April 2005 “Housing Maps mashup”, opened up people’s eyes to the possibility of map-based mashups; a short while later, in May 2005, Adrian Holovaty’s Chicago Crime Map showed how the same mashup idea could be used as an example of “live”, automated and geographically contextualised reporting of crime data. Mashups were all about appropriating web technologies and web content, building new “stuff” from pre-existing “stuff” that was already out there. And as an idea, mashups became all the rage way back then, offering as they did the potential for appropriating, combining and re-presenting elements of different web applications and publications without the need for (further) programming.
In March 2006, a year or so after the first demonstration of the Housing Maps mashup, and in part as a response to the difficulty in getting hold of latitude and longitude data for UK based locations that was required to build Google maps mashups around British locations, the Guardian Technology supplement (remember that? It had Kakoru puzzles and everything?!;-) launched the “Free Our Data” campaign (history). This campaign called for the free release of data collected at public expense, such as the data that gave the latitude and longitude for UK postcodes.
The early promise of, and popular interest in “mashups” waxed, and then waned; but there was a new tide rising in the information system that is the web: access to data. The mashups had shown the way forward in terms of some of the things you could do if you could wire different applications together, but despite the promise of no programming it was still too techie, too geeky, too damned hard and fiddly for most people; and despite what the geeks said, it was still programming, and there often still was coding involved. So the focus changed. Awareness grew about the sorts of “mashup” were possible, so now you could ask a developer to build you “something like that”, as you pointed to an appropriate example. The stumbling block now was access to the data to power an app that looked like that, but did the same thing for this.
For some reason, the notion of “open” public data hit a policy nerve, and in the UK, as elsewhere, started to receive cross-party support. (A brief history of open public data in a UK context is illustrated in the first part of Open Standards and Open Data.) The data started to flow, or at least, started to become both published (through mandated transparency initiatives, such as the release of public accounting data) and requestable (for example, via an extension to FOI by the Protection of Freedoms Act 2012).
We’ve now got access in principle and in practice to increasing amounts of data, we’ve seen some of the ways in which it can be displayed and, to a certain extent, started to explore some of the ways in which we can use it as a source for news stories. So the time is right in data terms for data driven journalism, right?
- access to visualisation technologies: it wasn’t very long ago when it was still really hard to display data on screen using anything other than canned chart types – pie charts, line charts, bar charts (that is, the charts you were introduced to in primary school. How many chart types have you learned to read, or create, since then?). Spreadsheets offer a range of grab-and-display chart generating wizards, of course, but they’re not ideal when working with large datasets, and they’re typically geared for generating charts for reports, rather than being used analytically. The visual analysis mantra – Overview first, zoom and filter, then details-on-demand – (coined in Ben Schneiderman’s 1997 article A Grander Goal: A Thousand-Fold Increase in Human Capabilities, I think?) arguably requires fast computers and big screens to achieve the levels of responsiveness that is required for interactive usage, and we have those now…
There are, however, still some considerable barriers to access:
- access to clean data: you might think I’m repeating myself here, but access to data and access to clean data are two separate considerations. A lot of the data that’s out there and published is still not directly usable (you can’t just load it into a spreadsheet and work on it directly); things that are supposed to match often don’t (we might know that Open Uni, OU and Open University refer to the same thing, but why should a spreadsheet?); number columns often contain things that aren’t numbers (such as commas or other punctuation); dates are provided in a wide variety of formats that we can recognise as such, but a computer can’t – at least, not unless we give it a bit of help; data gets misplaced across columns; character encodings used by different applications and operating systems don’t play nicely; typos proliferate; and so on. So whose job is it to clean the data before it can be inspected or analysed?
- access to skills and workflows: engineering practice tends to have a separation between the notion of “engineer” and “technician”. Over-generalising and trivialising matters somewhat, engineers have academic training, and typically come at problems from a theory dominated direction; technicians (or technical engineers) have the practical skills that can be used to enact the solutions produced by the engineers. (Of course, technicians can often suggest additional, or alternative, solutions, in part reflecting a better, or more immediate, knowledge about the practical considerations involved in taking one course of action compared to another.) At the moment, the demarcation of roles (and skills required at each step of the way) in a workflow based around data discovery, preparation, analysis and reporting is still confused.
- What questions should ask? If you think of data as a source, with a story to tell: how do you set about finding that source? Why do you even think you want to talk to that source? What sorts of questions should you ask that source, and what sorts of answer might you reasonably expect it to provide you with? How can you tell if that source is misleading you, lying to you, hiding something from you, or is just plain wrong? To what extent do you or should you trust a data source? Remember, ever cell in a spreadsheet is a fact. If you have a spreadsheet containing a million data cells, that’s a lot of fact checking to do…
- low or misplaced expectations: we don’t necessarily expect Journalism students to know how to drive to a spreadsheet let alone run or apply complex statistics, or even have a great grasp on “the application of number”; but should they? I’m not totally convinced we need to get them up to speed with yesterday’s tools and techniques… As a tool builder/tool user, I keep looking for tools and ways of using tools that may be thought of as emerging “professional” tools for people who work with data on a day-to-day basis, but wouldn’t class themselves as data scientists, or data researchers; tools for technicians, maybe. When presenting tools to students, I try showing the tools that are likely to be found on a technician’s workbench. As such, they may look a little bit more technical than tools developed for home use (compare a socket set from a trade supplier with a £3.50 tool-roll bargain offer from your local garage), but that’s because they’re quality tools that are fit for purpose. And as such, it may take a bit of care, training and effort to learn how to use them. But I thought the point was to expose students to “industry-strength” ideas and applications? And in an area where tools are developing quite quickly, students are exactly the sort of people we need to start engaging with them: 1) at the level of raising awareness about what these tools can do; 2) as a vector for knowledge and technology transfer, getting these tools (or at least, ideas about what they can do) out into industry; 3) for students so inclined, recruiting those students for the further development of the tools, recruiting power users to help drive requirements for future iterations of the tools, and so on. If the journalism students are going to be the “engineers” to the data wrangler technicians, it’ll be good for them to know the sorts of things they can reasonably ask their technicians to help them to do…Which is to say, the journalists need exposing to the data wrangling factory floor.
Although a lot of the #ddj posts on this OUseful.info blog relate to tools, the subtext is all about recognising data as a medium, the form particular datasets take, and the way in which different tools can be used to work with these forms. In part this leads to a consideration of the process questions that can be asked of a data source based on identifying natural representations that may be contained within it (albeit in hidden form). For example, a list of MPs hints at a list of constituencies, which have locations, and therefore may benefit from representation in a geographical, map based form; a collection of emails might hint at a timeline based reconstruction, or network analysis showing who corresponded with whom (and in what order), maybe?
And finally, something that I think is still lacking in the formulation of data journalism as a practice is an articulation of the process of discovering the stories from data: I like the notion of “conversations with data” and this is something I’ll try to develop over forthcoming blog posts.
PS see also @dkernohan’s The campaigning academic?. At the risk of spoiling the punchline (you should nevertheless go and read the whole thing), David writes: “There is a space – in the gap between academia and journalism, somewhere in the vicinity of the digital humanities movement – for what I would call the “campaigning academic”, someone who is supported (in a similar way to traditional research funding) to investigate issues of interest and to report back in a variety of accessible media. Maybe this “reporting back” could build up into equivalence to an academic reward, maybe not.
These would be cross-disciplinary scholars, not tied to a particular critical perspective or methodology. And they would likely be highly networked, linking in both to the interested and the involved in any particular area – at times becoming both. They might have a high media profile and an accessible style (Ben Goldacre comes to mind). Or they might be an anonymous but fascinating blogger (whoever it is that does the wonderful Public Policy and The Past). Or anything in between.
But they would campaign, they would investigate, they would expose and they would analyse. Bringing together academic and old-school journalistic standards of integrity and verifiability.”
Mixed up in my head – and I think in David’s – is the question of “public accounting”, as well as sensemaking around current events and trends, and the extent to which it’s the role of “the media” or “academic” to perform such a function. I think there’s much to be said for reimagining how we inform and educate in a network-centric web-based world, and it’s yet another of those things on my list of things I intend to ponder further… See also: From Academic Privilege to Consultations as Peer Review.
I’m starting to feel as if I need to do myself a weekly round-up, or newsletter, on open data, if only to keep track of what’s happening and how it’s being represented. Today, for example, the Commons Public Accounts Committee published a report on Implementing the Transparency Agenda.
From a data wrangling point of view, it was interesting that the committee picked up on the following point in its Conclusions and recommendations (thanks for the direct link, Hadley:-), whilst also missing the point…:
2. The presentation of much government data is poor. The Cabinet Office recognises problems with the functionality and usability of its data.gov.uk portal. Government efforts to help users access data, as in crime maps and the schools performance website, have yielded better rates of access. But simply dumping data without appropriate interpretation can be of limited use and frustrating. Four out of five people who visit the Government website leave it immediately without accessing links to data. So there is a clear benefit to the public when government data is analysed and interpreted by third parties – whether that be, for example, by think-tanks, journalists, or those developing online products and smartphone applications. Indeed, the success of the transparency agenda depends on such broader use of public data. The Cabinet Office should ensure that:
– the publication of data is accessible and easily understood by all; and
– where government wants to encourage user choice, there are clear criteria to determine whether government itself should repackage information to promote public use, or whether this should be done by third parties.
A great example of how data not quite being published consistently can cause all sorts of grief when trying to aggregate it came to my attention yesterday via @lauriej:
Laura James (@LaurieJ) July 31, 2012
It leads to a game where you can help make sense of not quite right column names used to describe open spending data… (I have to admit, I found the instructions a little hard to follow – a screenshot walked through example would have helped? It is, after all, largely a visual pattern matching exercise…)
From a spend mapping perspective, this is also relevant:
6. We are concerned that ‘commercial confidentiality’ may be used as an inappropriate reason for non-disclosure of data. If transparency is to be meaningful and comprehensive, private organisations providing public services under contract must make available all relevant public information. The Cabinet Office should set out policies and guidance for public bodies to build full information requirements into their contractual agreements, in a consistent way. Transparency on contract pricing which is often hidden behind commercial confidentiality clauses would help to drive down costs to the taxpayer.
And from a knowing “what the hell is going on?” perspective, there was also this:
7. Departments do not make it easy for users to understand the full range of information available to them. Public bodies have not generally provided full inventories of all of the information they hold, and which may be available for disclosure. The Cabinet Office should develop guidance for departments on information inventories, covering, for example, classes of information, formats, accuracy and availability; and it should mandate publication of the inventories, in an easily accessible way.
The publication of government department open data strategies may go some way to improving this. I’ve also been of a mind that more accessible ways of releasing data burden reporting requirements could help clarify what “working data” is available, in what form, and the ways in which it is routinely being generated and passed between bodies. Sorting out better pathways between FOI releases of data and the then regular release of such data as opendata is also something I keep wittering on about (eg FOI Signals on Useful Open Data? and The FOI Route to Real (Fake) Open Data via WhatDoTheyKnow).
From within the report, I also found a reiteration of this point notable:
This Committee has previously argued that it is vital that we and the public can access data from private companies who contract to provide public services. We must be able to follow the taxpayers’ pound wherever it is spent. The way contracts are presently written does not enable us to override rules about commercial confidentiality. Data on public contracts delivered by private contractors must be available for scrutiny by Parliament and the public. Examples we have previously highlighted include the lack of transparency of financial information relating to the Private Finance Initiative and welfare to work contractors.
…not least because data releases from companies is also being addressed on another front, midata, most notably via the recently announced BIS Midata 2012 review and consultation [consultation doc PDF]. For example, the consultation document suggests:
1.10 The Government is not seeking to require the release of data electronically at this stage, and instead is proposing to take a power to do so. The Secretary of State would then have to make an order to give effect to the power. An order making power, if utilised, would compel suppliers of services and goods to provide to their customers, upon request, historic transaction/ consumption data in a machine readable format. The requirement would only apply to businesses that already hold this information electronically about individual consumers.
1.11. Data would only have to be released electronically at the request of the consumer and would be restricted to an individual’s consumption and transaction data, since in our view this can be used to better understand consumers’ behaviour. It would not cover any proprietary analysis of the data, which has been done for its own purposes by the business receiving the request.
(More powers to the Minister then…?!) I wonder how this requirement would extend rights available under the Data Protection Act (and why couldn’t that act be extended? For example, Data Protection Principle 6 includes “a right of access to a copy of the information comprised in their personal data” – couldn’t that be extended to include transaction data, suitably defined? Though I note 1.20. There are a number of different enforcement bodies that might be involved in enforcing midata. Data protection is enforced by the Information Commissioner’s Office (ICO), whilst the Office of Fair Trading (OFT), Trading Standards and sector regulators currently enforce consumer protection law. and Question 17: Which body/bodies is/are best placed to perform the enforcement role for this right?) There are so many bits of law around relating to data that I don’t understand at all that I think I need to do myself an uncourse on them… (I also need to map out the various panels, committees and groups that have an open data interest… The latest, of course, is the Open Data User Group (ODUG), the minutes of whose first meeting were released some time ago now, although not in a directly web friendly format…)
The consultation goes on:
1.18. For midata to work well the data needs be made available to the consumer in electronic format as quickly as possible following a request (maybe immediately) and as inexpensively as possible. This will minimise friction and ensure that consumers are able to access meaningful data at the point it is most useful to them. This requirement will only cover data that is already held electronically at the time of the request so we expect that the time needed to respond to a consumer’s request will be short – in many cases instant
Does the Data Protection Act require the release of data in an electronic format, and ideally a structured electronic format (i.e. as something resembling a dataset? The recent Protection of Freedoms Act amended the FOI Act with language relating to the definition and release of datasets, so I wonder if this approach might extend elsewhere?
Coming at the transparency thing from another direction, I also note with interest (via the BBC) that MPs say all lobbyists should be on new register:
All lobbyists, including charities, think tanks and unions, should be subject to new lobbying regulation, a group of MPs have said. They criticised government plans to bring in a statutory register for third-party lobbyists, such as PR firms, only. They said the plan would “do nothing to improve transparency”. Instead, the MPs said, regulation should be brought in to cover all those who lobby professionally.
This is surely a blocking move? If we can’t have a complete register, we shouldn’t have any register. So best not to have one at all for a year or two.. or three… or four… Haven’t they heard of bootstrapping and minimum viability releases?! Or maybe I got the wrong idea from the lead I took from the start of the news report? I guess I need to read what the MPs actually said in the Political and Constitutional Reform – Second Report: Introducing a statutory register of lobbyists.
PS For a round-up of other recent reports on open data, see OpenData Reports Round Up (Links…).
PPS This is also new to me: new UK Data Service “starting on 1 October 2012, [to] integrate the Economic and Social Data Service (ESDS), the Census Programme, the Secure Data Service and other elements of the data service infrastructure currently provided by the ESRC, including the UK Data Archive.”
A discussion, earlier, about whether it was now illegal to drink in public…
…I thought not, think not, at least, not generally… My understanding was, that local authorities can set up controlled, alcohol free zones and create some sort of civil offence for being caught drinking alcohol there. (As it is, councils can set up regions where public consumption of alcohol may be prohibited and this prohibition may be enforced by the police.) So surely there must be an #opendata powered ‘no drinking here’ map around somewhere? The sort of thing that might result from a newspaper hack day, something that could provide a handy layer on a pub map? I couldn’t find one, though…
I did a websearch, turned up The Local Authorities (Alcohol Consumption in Designated Public Places) Regulations 2007, which does indeed appear to be the bit of legislation that regulates drinking alcohol in public, along with a link to a corresponding guidance note: Home Office circular 013 / 2007:
16. The provisions of the CJPA [Criminal Justice and Police Act 2001, Chapter 2 Provisions for combatting alcohol-related disorder] should not lead to a comprehensive ban on drinking in the open air.
17. It is the case that where there have been no problems of nuisance or annoyance to the public or disorder having been associated with drinking in that place, then a designation order … would not be appropriate. However, experience to date on introducing DPPOs has found that introducing an Order can lead to nuisance or annoyance to the public or disorder associated with public drinking being displaced into immediately adjacent areas that have not been designated for this purpose. … It might therefore be appropriate for a local authority to designate a public area beyond that which is experiencing the immediate problems caused by anti-social drinking if police evidence suggests that the existing problem is likely to be displaced once the DPPO was in place. In which case the designated area could include the area to which the existing problems might be displaced.
Creepy, creep, creep…
This, I thought, was interesting too, in the guidance note:
37. To ensure that the public have full access to information about designation orders made under section 13 of the Act and for monitoring arrangements, Regulation 9 requires all local authorities to send a copy of any designation order to the Secretary of State as soon as reasonably practicable after it has been made.
38. The Home Office will continue to maintain a list of all areas designated under the 2001 Act on the Home Office website: www.crimereduction.gov.uk/alcoholorders01.htm [I'm not convinced that URL works any more...?]
39. In addition, local authorities may wish to consider publicising designation orders made on their own websites, in addition to the publicity requirements of the accompanying Regulations, to help to ensure full public accessibility to this information.
So I’m thinking: this sort of thing could be a great candidate for a guidance note from the Home Office to local councils recommending ways of releasing information about the extent of designation orders as open geodata. (Related? Update from ONS on data interoperability (“Overcoming the incompatibility of statistical and geographic information systems”).)
I couldn’t immediately find a search on data.gov.uk that would turn up related datasets (though presumably the Home Office is aggregating this data, even if it’s just in a filing cabinet or mail folder somewhere*), but a quick websearch for Designated Public Places site:gov.uk intitle:council turned up a wide selection of local council websites along with their myriad ways of interpreting how to release the data. I’m not sure if any of them release the data as geodata, though? Maybe this would be an appropriate test of the scope of the Protection of Freedoms Act Part 6 regulations on the right to request data as data (I need to check them again…)?
* The Home Office did release a table of designated public places in response to an FOI request about designated public place orders, although not as data… But it got me wondering: if I scheduled a monthly FOI request to the Home Office requesting the data on a monthly basis, would they soon stop fulfilling the requests as timewasting? How about if we got a rota going?! Is there any notion of a longitudinal/persistent FOI request, that just keeps on giving (could I request the list of designated public places the Home Office has been informed about over the last year, along with a monthly update of requests in the previous month (or previous month but one, or whatever is reasonable…) over the next 18 months, or two years, or for the life of the regulation, or until such a time as the data is published as open data on a regular basis?
As for the report to government that a local authority must make on passing a designation order – 9. A copy of any order shall be sent to the Secretary of State as soon as reasonably practicable after it has been made. – it seems that how the area denoted as a public space is described is moot: “5. Before making an order, a local authority shall cause to be published in a newspaper circulating in its area a notice— (a)identifying specifically or by description the place proposed to be identified;“. Hmmm, two things jump out there…
Firstly, a local authority shall cause to be published in a newspaper circulating in its area [my emphasis; how is a newspaper circulating in its area defined? Do all areas of England have a non-national newspaper circulating in that area? Does this implicitly denote some "official channel" responsibility on local newspapers for the communication of local government notices?]. Hmmm…..
Secondly, the area identified specifically or by description. On commencement, the order must also be made public by “identifying the place which has been identified in the order”, again “in a newspaper circulating in its area”. But I wonder – is there an opportunity there to require something along the lines of and published using an appropriate open data standard in a open public data repository, and maybe further require that this open public data copy is the one that is used as part of the submission informing the Home Office about the regulation? And if we go overboard, how about we further require that each enacted and proposed order is published as such along with a machine readable geodata description and that a single aggregate files containing all that Local Authority’s currently and planned Designated Public Spaces are also published (so one URL for all current spaces, one for all planned ones). Just by the by, does anyone know of any local councils publishing boundary date/shapefiles that mark out their Designated Public Spaces? Please let me know via the comments, if so…
A couple of other, very loosely (alcohol) related, things I found along the way:
- Local Alcohol Profiles for England: the aim appears to have been the collation of, and a way of exploring, a “national alcohol dataset”, that maps alcohol related health indicators on a PCT (Primary Care Trust) and LA (local authority) basis. What this immediately got me wondering was: did they produce any tooling, recipes or infrastructure that would it make a few clicks easy to pull together a national tobacco dataset and associated website, for example? And then I found the Local Tobacco Control Profiles for England toolkit on the London Health Observatory website, along with a load of other public health observatories and it made me remember – again – just how many data sensemaking websites there already are out there…
- UK Alcohol Strategy – maybe some leads into other datasets/data stories?
PS I wonder if any of the London Boroughs or councils hosting regional events have recently declared any new Designated Public Spaces #becauseOfTheOlympics.
It feels like there are just too many opendata reports being published at the moment to know which ones to read? They do potentially provide lots of possible content for structured reading exercises in an (open) data course though….?
Here’s a list of some the reports I’ve noticed recently, and that I haven’t really had time to read and digest properly:-(
- Open Data White Paper: Unleashing the Potential (Cabinet Office, June 2012)
- Implementing transparency (National Audit Office (NAO), April 2012)
- Report on Using Open Data: policy modeling, citizen empowerment, data journalism (W3C, June 2012)
- The Data Dividend (Demos, March 2012)
- The Big Data Opportunity: Making government faster, smarter and more personal (Policy Exchange/lobbiests, July 2012)
- Open data and charities: a state of the art review (Nominet Trust, July 2012)
- Open data dialogue final report (RCUK, June 2012)
- Open Data in Cultural Heritage Institutions (EPSI Platform, May 2012)
- Open Aid Data (EPSI Platform, May 2012)
Whilst not specifically about open data, these are also related to whole data and openness thang:
- Defining and defending consumer interests in the digital age (Ctrl-Shift/Consumer Focus, June 2012)
- #Intelligence (Demos, May 2012)
- Data Jujitsu: The art of turning data into product (O’Reilly, July 2012)
- Science as an open enterprise (Royal Society, June 2012)
UK Gov Departments also published their open data strategies – they’re linked to from here: UK Gov Departmental Open Data Strategies.
PS I’m not sure if an English translation of this report (in Dutch) on Internal Business Models for Open Government Data is available anywhere?
Via a BIS press release earlier this week – Better access to public sector information moves a step closer – it seems that the Data Strategy Board is on its way, along with a Public Data Group and an Open Data User Group (these are separate from the yet to be constituted Open Standards Board (if you’re quick, the deadline for membership of the board is tomorrow: Open Standards Board – Volunteer Members and Board Advisers, – Ref:1238758) and its feeder Open Data Standards, and Open Technical Standards panels).
So what does the press release promise?
A new independently chaired Data Strategy Board (DSB) will advise Ministers on what data should be released [will this draw on data requests made to data.gov.uk, I wonder? - TH] and has the potential to unlock growth opportunities for businesses across the UK. At least one in three members of the DSB will be from outside government, including representatives of data re-users.
The DSB will work with the Public Data Group (PDG) – which consists of Trading Funds the Met Office, Ordnance Survey, Land Registry and Companies House – to provide a more consistent approach to improving access to public sector information. These organisations have already made some data available, which has provided opportunities for developers and entrepreneurs to create imaginative ways to develop or start up their own businesses based on high quality data.
Looking at the Terms of reference for the Data Strategy Board & the Public Data Group, we can broadly see how they’re organised:
Three departmental agendas then…?! A good sign, or, erm..?! (I haven’t read the Terms of reference properly yet – that’s maybe for another post…)
How these fit in with the Public Sector Transparency Board and the Local Public Data Panel, I’m not quite sure, though it might be quite interesting to try and map out the strong and weak ties between them once their memberships are announced? It’d also be interesting to know whether there’d be any mechanism for linking in with open data standards recommendations and development (via the Standards Hub process to ensure that as an when data gets released, there is at least an eye towards releasing it in a usable form!
The Government is making £7m available from April 2013 for the DSB to purchase additional data for free release from the Trading Funds and potentially other public sector organisations, funded by efficiency savings. An Open Data User Group, which will be made up of representatives from the Open Data community, will be directly involved in decisions on the release of Open Data, advising the DSB on what data to purchase from the Trading Funds and other public organisations and release free of charge.
So the DSB is a pseudo-cartel of sort-of government data providers (the Trading Funds) who are being given £7 million or so to open up data that the public purse (I think?) paid them to collect. The cash is there to offset the charges they would otherwise have made selling the data. (Erm… so, in order for those agencies to give their data away for free, we have to pay them to do it? Right… got it…) Presumably, the DSB members won’t be on the ODG who will be advising the DSB on what data to purchase from the Trading Funds and other public organisations and release free of charge (my emphasis). Note the explicit recognition here that free actually costs. In this case, public bodies are having data central gov paid them to collect bought off them by central gov so (central gov, or the bodies themselves) can then release it “for free”? Good. That’s clear then…
Francis Maude also clarifies this point: “The new structure for Open Data will ensure a more inclusive discussion, including private sector data users, on future data releases, how they should be paid for and which should be available free of charge.”
In addition: The DSB will provide evidence on how data from the Trading Funds – including what is released free of charge – will generate economic growth and social benefit. It will act as an intelligent customer advising Government on commissioning and purchasing key data and services from the PDG, and ensuring the best deal for the taxpayer. So maybe this means the Public Sector Transparency Board will now focus more on “public good” and transparency” arguments, leaving the DSB to demonstrate the financial returns of open data?
The Open Data User Group (ODUG) [will] support the work of the new Data Strategy Board (DSB). [The position of Chair of the group is currently being advertised, if you fancy it...: Chair of Open Data User Group, - Ref:1240914 -TH]. The ODUG will advise the DSB on public sector data that should be prioritised for release as open data, to the benefit of the UK.
As part of the process, an open suggestion site has been set up using the Delib Dialogue app to ask “the community” How should the Open Data User Group engage with users and re-users of Open Data?: [i]n advance of appointing a Chair and Members of the group, the Cabinet Office wants to bring together suggestions for how the ODUG should go about this engagement with wider users and re-users. We are looking for ideas about things like how the ODUG should gather evidence for the release of open data, how it should develop it’s advice to the DSB, how it should run its meetings and how it should keep the wider community up to date on developments (as well as other ideas you have).
A Twitter account has also been pre-emptively set up to manage some of the social media engagement activites of the group: @oduguk
The account currently has just over a couple of hundred followers, so I grabbed the list of all the folk they follow, then graphed folk followed by 30 or more current followers of @oduguk.
Here’s the graph, laid out in Gephi using a fore directed layout, with nodes colured according to modularity group and sized by eigenvector centrality:
Here’s the same graph with nodes size by betweenness centrality:
By the by, responses to the Data Policy for a Public Data Corporation consultation have also been published, including with the Government response, which I haven’t had chance to read yet… If I get a chance, I’ll try to post some thoughts/observations on that alongside a commentary on the terms of reference doc linked to above somewhere…
A recent provisional data release from the Ministry of Justice contains sentencing data from English(?) courts, at the offence level, for the period July 2010-June 2011: “Published for the first time every sentence handed down at each court in the country between July 2010 and June 2011, along with the age and ethnicity of each offender.” Criminal Justice Statistics in England and Wales [data]
In this post, I’ll describe a couple of ways of working with the data to produce some simple graphical summaries of the data using Google Fusion Tables and R…
…but first, a couple of observations:
- the web page subheading is “Quarterly update of statistics on criminal offences dealt with by the criminal justice system in England and Wales.”, but the sidebar includes the link to the 12 month set of sentencing data;
- the URL of the sentencing data is http://www.justice.gov.uk/downloads/publications/statistics-and-data/criminal-justice-stats/recordlevel.zip, which does not contain a time reference, although the data is time bound. What URL will be used if data for the period 7/11-6/12 is released in the same way next year?
The data is presented as a zipped CSV file, 5.4MB in the zipped form, and 134.1MB in the unzipped form.
The unzipped CSV file is too large to upload to a Google Spreadsheet or a Google Fusion Table, which are two of the tools I use for treating large CSV files as a database, so here are a couple of ways of getting in to the data using tools I have to hand…
Unix Command Line Tools
I’m on a Mac, so like Linux users I have ready access to a Console and several common unix commandline tools that are ideally suited to wrangling text files (on Windows, I suspect you need to install something like Cygwin; a search for windows unix utilities should turn up other alternatives too).
In Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line: EDINA OpenURL Logs and Postcards from a Text Processing Excursion I give a couple of examples of how to get started with some of the Unix utilities, which we can crib from in this case. So for example, after unzipping the recordlevel.csv document I can look at the first 10 rows by opening a console window, changing directory to the directory the file is in, and running the following command:
Or I can pull out rows that contain a reference to the Isle of Wight using something like this command:
grep -i wight recordlevel.csv > recordsContainingWight.csv
(The -i reads: “ignoring case”; grep is a command that identifies rows contain the search term (wight in this case). The > recordsContainingWight.csv says “send the result to the file recordsContainingWight.csv” )
Having extracted rows that contain a reference to the Isle of Wight into a new file, I can upload this smaller file to a Google Spreadsheet, or as Google Fusion Table such as this one: Isle of Wight Sentencing Fusion table.
Once in the fusion table, we can start to explore the data. So for example, we can aggregate the data around different values in a given column and then visualise the result (aggregate and filter options are available from the View menu; visualisation types are available from the Visualize menu):
We can also introduce filters to allow use to explore subsets of the data. For example, here are the offences committed by females aged 35+:
Looking at data from a single court may be of passing local interest, but the real data journalism is more likely to be focussed around finding mismatches between sentencing behaviour across different courts. (Hmm, unless we can get data on who passed sentences at a local level, and look to see if there are differences there?) That said, at a local level we could try to look for outliers maybe? As far as making comparisons go, we do have Court and Force columns, so it would be possible to compare Force against force and within a Force area, Court with Court?
If you really want to start working the data, then R may be the way to go… I use RStudio to work with R, so it’s a simple matter to just import the whole of the reportlevel.csv dataset.
Once the data is loaded in, I can use a regular expression to pull out the subset of the data corresponding once again to sentencing on the Isle of Wight (i apply the regular expression to the contents of the court column:
recordlevel <- read.csv("~/data/recordlevel.csv")
We can then start to produce simple statistical charts based on the data. For example, a bar plot of the sentencing numbers by age group:
barplot(age, main="IW: Sentencing by Age", xlab="Age Range")
We can also start to look at combinations of factors. For example, how do offence types vary with age?
barplot(ageOffence,beside=T,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))
If we remove the beside=T argument, we can produce a stacked bar chart:
barplot(ageOffence,las=3,cex.names=0.5,main="Isle of Wight Sentences", xlab=NULL, legend = rownames(ageOffence))
If we import the ggplot2 library, we have even more flexibility over the presentation of the graph, as well as what we can do with this sort of chart type. So for example, here’s a simple plot of the number of offences per offence type:
#You may need to install ggplot2 as a library if it isn't already installed
ggplot(iw, aes(factor(Offence_type)))+ geom_bar() + opts(axis.text.x=theme_text(angle=-90))+xlab('Offence Type')
Alternatively, we can break down offence types by age:
ggplot(iw, aes(AGE))+ geom_bar() +facet_wrap(~Offence_type)
We can bring a bit of colour into a stacked plot that also displays the gender split on each offence:
ggplot(iw, aes(AGE,fill=sex))+geom_bar() +facet_wrap(~Offence_type)
One thing I’m not sure how to do is rip the data apart in a ggplot context so that we can display percentage breakdowns, so we could compare the percentage breakdown by offence type on sentences awarded to males vs. females, for example? If you do know how to do that, please post a comment below ;-)
PS HEre’s an easy way of getting started with ggplot… use the online hosted version at http://www.yeroon.net/ggplot2/ using this data set: wightCrimRecords.csv; download the file to your computer then upload it as shown below:
PPS I got a little way towards identifying percentage breakdowns using a crib from here. The following command:
generates a (multidimensional) array for the responseVar (Offence) about the groupVar (sex). I don’t know how to generate a single data frame from this, but we can create separate ones for each sex as follows:
We can then plot these percentages using constructions of the form:
What I haven’t worked out how to do is elegantly map from the multidimensional array to a single data.frame? If you know how, please add a comment below…(I also posted a question on Cross Validated, the stats bit of Stack Exchange…)
Quick core dump of thoughts, largely culled from things I’ve doodled before… Needs much more work, but time is running out on me… So any/all comments appreciated….
***1. Do the definitions of the key terms go far enough or too far?
“Public services are either provided by public bodies, or providers who have been funded, commissioned or established by statute to provide a service”
I assume the definitions of open data and public services are to be taken together, with the consultation focussing on ‘open (public) data produced by public services’? For such bodies, I assume there is also a formal “data burden” that defines the public data reporting requirements to the centre, as well as devolved data burdens eg into local government from schools? Would it make sense to clarify the notion and extent of data burdens, and the extent to which elements of these (and the organisations they apply to) should be subject to open data requirements? I guess there is also a data burden placed on individual citizens in respect of filing tax forms, for example, that are not subject to openness requirements?
A clear statement at least of data burdens/formal reporting requirements between public bodies that are in scope for mandatory release as open public data should be made available, eg along the lines of http://www.communities.gov.uk/localgovernment/decentralisation/tacklingburdens/singledatalist/ http://getthedata.org/questions/500/data-burden-on-uk-higher-education (I know some work has already been done on this that I used as the basis for a simple data brden visualisation exercise ( http://www.flickr.com/photos/psychemedia/5536836259/ ).)
It may be useful to distinguish between data collected for operational, administrative or statistical use, as well as the extent to which data produced in the normal course of events is being legitimately requested as is, or whether it must be processed before release (eg http://www.adls.ac.uk/what-is-administrative-data-and-why-use-it-for-research/ http://www.unsiap.or.jp/ms/ms7/DennisP1_OppoChalle.pdf ).
It may also be worth distinguishing between the release of complete data sets, views over the data that represent a query on a complete dataset, and queries, sampling procedures or any other means that are used to generate those data views. For example, providing data relating to performance indicators for a particular school in response to an FOI request from a citizen equates to the provision of a particular view over the database containing perfomance indicators for UK schools as a whole; providing a copy of the database as a whole to a developer of a school comparison website represents the provision of a complete dataset.
Datasets may provide value to others in a variety of ways: for example:
- using complete datasets as the basis of comparison or recommendation services;
- using complete datasets to support statistical analyses, segmentation/clustering of data;
- generating very particular or specific views over the dataset by constructing meaningful and appropriate queries on the datasets. Queries are also reusable, and whilst some cost may be incurred in creating them, making them open, and suitably parameterising them, the marginal cost of reusing the queries is then minimal. It is possible that queries that take a long time to create/optimise become valuable in their own right, and that the dataset and the view can be given away freely. The query unlocks value in the dataset and delivers it to the requester. When it comes to government reporting, where reports include summary views over open datasets, the openness/transparency requirement should not deem to be met unless the query that generates the view from the dataset is also openly published.
Datasets may also include recordsets relating to an individual; where personal access to personal data/mydata is possible, we need to distinguish between the private/personal right for an indvidual to access their data, or an agent acting on their behalf and with their permission, as opposed to general public access.
****2. Where a decision is being taken about whether to make a dataset open, what tests should be applied?
If data is part of a formally defined data burden, should that data burden be tiered in terms of openness requirements, for example along lines of:
- open on submission to the centre;
- open following embargo period and subject to checking by the centre, but with the presumption that it will be opened;
- not open;
Where data is FOIable, that may be taken as evidence in favour of presumed openness. If data is regularly requsted via FOI, it could be made available in open form as a matter of course in order to reduce FOI overheads in the future. When data is released via FOI, it could be made available via an open data site in partial fulfilment of handling the FOI request. When responding to FOI requests for data, the process required to obtain and release that data could be captured and compared with the actual processes relating to operational and administrative use of that data in order to identify whether an open data tap can be introduced into the current data process to open it as a matter of course, or release it efficiently in response to an FOI request.
As the major producer and consumer of public data, public bodies are well placed to benefit from more open public data. “Publicness” and “openness” both help make data accessible for use within and between public bodies, we well as reuse by third parties; accessibility is also improved by timely release of data, and the publication of data using open standards and formats.
Consequences of making data open should also be considered; for example, once released, will there be continued access to regular updates of the data using the same format. (If the data is released sporadically and with inconsistent formats, services that automate the regular collection of the data are not really viable).
****3. If the costs to publish or release data are not judged to represent value for money, to what extent should the requestor be required to pay for public services data, and under what circumstances?
Where work must be done that does not represent value for money (what would an example of this be? Having to get data into a form the public body would never use?), it may be appropriate to consider the amount of value that is added in processing the data that the requester might otherwise be expected to add, for example as just reward for the cost of processing that data. If the raw data is open, and the requester asks for processed data, it may be appropriate to give the raw data away freely but charge for the value add of processing it that the requester seeks to exploit in the course of their business? However, there will also be a tension between people who want to gain access to a small amount of data, either for personal use, research/innovation purposes, and companies who make use of that data in volume as part of a business. In the latter case, we might expect some payment for use of the data once the business is operating, although it could be argued that if the business is profitable, there is a return built in through taxation.
A balance may need to be struck based on the number of independent requests that are likely to be received for a particular data set and the use they wish to put it to. If N requests are made for the data, and all N parties need to do the same work cleaning or processing the data in the same way, that is obviously inefficient. It may be that third parties process and repackage data, for a fee. But the question arises – if data as published is not fit for use by third parties, is it fit for use by the first (producer) or second (‘official consumer’) party, or has the data been produced solely in response to some openness criteria, and not because the data is actually used for anything?
The ability to save cost elsewhere in government may also be an issue. For example, local authorities who make disbursements to care homes need to mitigate against fraud by regularly checking death reports, often through the purchase of commercial death registers or by checking the local newspaper’s death notices. Whilst a cost may be associated with signatories of death certificates ensuring this data enters the public body data chain in an accessible and open way, it may well save costs in multiple other areas of government.
Where data is processed and released in exchange for a payment, would it also be possible for the raw underlying data to also be made available for free so that third parties can, at their own expense, carry out the required processing if they can do so for less overall cost than piecewise purchase of data from the public body?
****4. How do we get the right balance in relation to the range of organisations (providers of public services) our policy proposals apply to? What threshold would be appropriate to determine the range of public services in scope and what key criteria should inform this?
If an organisation is subject to FOI requests, or data it produces and returns as part of an official data burden may be requested through FOI requests, it should be in scope?
Analysis of data processes associated with fulfilling data burden requirements might provide a basis for identifying where in a data process data might reasonably be made public and open.
*****5. What would be appropriate mechanisms to encourage or ensure publication of data by public service providers?
If data related FOI responses are published via open data sites, the open data site can become a repository of commonly requested data and help identify which processes might benefit from releasing open data as a matter of course.
Where public data is reported as a matter of course by the local press and in the local interest, (for example, court reports, planning notices, traffic notices), public bodies might be encouraged to publish the corresponding data in an open way in order to facilitate the local dissemination of that information. Note that much of this data is transitory/may only be relevant for a limited period. In this case, we need to consider: whether there is a public interest in making the data publicly available and open on an archival basis, or not providing archives per se, but responding to requests for archival copies of data; the extent to which third parties can archive/aggregate such data and continue to make it available; whether there are privacy reasons for not supporting archival access (for example, court reports in local newspapers have a “short memory”).
Are there guidelines available that cover the interactions between things like:
- data eligible for release under FOI;
- data that may be redacted on grounds of Data Protection Act
- data covered by Database Right or data that is covered by copyright
- data released through National Statistics ( http://www.legislation.gov.uk/ukpga/2007/18/contents )
- reusable public sector information ( http://www.legislation.gov.uk/uksi/2005/1515/contents/made )
Analysis of data burden reporting process might identify appropriate points at which data can be made open as part of the process. For example, reported data may be posted to an open data site from where it is collected (“pull reporting”). See also: http://blog.ouseful.info/2011/03/18/open-data-processes-taps-query-pathsaudit-trails-and-round-tripping/
And as I responded to the PDC Engagement Exercise, [o]ne particular class of data that interests me is data that is:
1) reported by a local organisation to a central body;
2) using a standardised, templated reporting format,
3) and that is FOIable either from the local organisation, and/or from the central body.
For example, in Higher Education, this might include data on library usage as reported to SCONUL, or marketing information about courses submitted to UCAS.
It can often be hard to find out how to phrase an FOI request to obtain this data as submitted, unless you know the type of reporting form used to submit it.
What I would like to see is the Public Data Corporation acting in part as a Public Data Exchange Directory, showing how different classes of public organisation make standard (public data containing) reports to other public organisations, detailing the standard report formats, with names/identifiers for those forms if appropriate, and describing which sections of the report are FOIable. This could also link in to the list of local council data burdens, for example ( http://www.communities.gov.uk/… and/or the code of practice for local authority transparency ( http://www.communities.gov.uk/… )
The next step would be to introduce a pubsub (publish-subscribe) model in the reporting chain for reporting documents* that are wholly FOIable. This could happen in several ways:
A) /open report publication/ – the publishing organisation could post their report to their opendata reporting store, and the consuming organisation (the one to which the report was being made) would subscribe to that store, collecting the data from there as it was published; third parties could also subscribe to the local publishing store and be alerted to reports as they are published. If co-publication to the central organisation and the public is not appropriate, the report could be witheld from public/press consumption for a specified period of days, or published to the press but not the public under embargo.
B) /open deposit/ – the publishing organisation publishes the report/data to an open deposit box owned by the central organisation which is receiving the report. After a specified period of time, the report is made public (ie published) via that central deposit box.
C) /data corp in the middle/ – a centralised architecture in which local organisations submit public reports to a Public Data Exchange, which then passes them on to the central body to which reports are made, and publishes them to the public, maybe after a fixed period of time.
The intention of all three approaches described above is to provide an open window onto the reporting chain. At the current time, open public data tends to be data that is published via a separate branch “to the public”. In contrast, the above approach suggests that public data publication acts as a view onto all or part of the data as it goes about it’s daily business being published from one organisation to another. That is, public data publication becomes a “tap” onto a dataflow/workflow process.
If one of the desires for data exploitation is to help introduce efficiencies as well as reuse in data related activities, third parties need to be able to work with data as it currently used.
***How will we ensure that Open Data standards are embedded in new ICT contracts?
By providing a test suite as part of the contract that include tests such as running data import/export/query operations against centralised validation services.
***What is the best way to achieve compliance on high and common standards to allow usability and interoperability?
Require data reporting to proceed through open interfaces or interfaces where public data taps can be applied. Released data should be authentic, and representative of data used as part of a public body’s activities or reporting duties rather than data that is produced purely for release on an open data site.
***How would we ensure that public service providers in their day to day decisionmaking honour a commitment to Open Data, while respecting privacy and security considerations.
Take a lead from open source software projects and publish requests via an issue tracker, that can show when an ‘issue’ was raised, what it’s current status is, and how it was resolved. Related approaches include services like WhatDoTheyKnow or GetTheData
***How should public services make use of data inventories? What is the optimal way to develop and operate this?
If we distinguish between datasets, queries on datasets, and reports/data view generated by queries on datsets on the one hand, and data burdens on the other, we can start to map out how queries are used on datasets to generate reports that fulfil data burden requirements. That has the benefit of making the data burden fulfilment process more transparent, as well as contextualising both the way those reports are generated (through exposing the queries) and the original data sets used as a basis for creating reports.
***Should the data that government releases always be of high quality? How do we define quality? To what extent should public service providers “polish” the data they publish, if at all?
One rule of thumb is that the data should be “good enough”. The question then arises, ‘good enough for whom?’. If the data is released and never referred to, its quality is irrelevant as regards the non-existent on-users, although it may signal problems elsewhere. If data is used by a third party and found to contain errors or omissions, the question arises: does the publisher also suffer from those some lack of quality issues (and if so, how are they handling them?); or are they using a different data set as part of the process that the released dataset relates to (and if so, why isn’t that data being released?)
There are different levels of cleanliness we may associate with data: a major issue in many datasets relates to the use of inconsistent labels to refer to the same entity (something that can be addressed by using universal persistent identifiers). Character set encodings can also cause problems, especially where it is hard to identify what character sets are used within a file.
***How should government approach the release of existing data for policy and research purposes: should this be held in a central portal or held on departmental portals?
As I understand the current situation, public body reports often produce summary tables and as part of transparency requirements, release as public data raw datasets that are used to generate those summary tables. In such cases, the query used to generate the summary table from the raw data should also be published. The transparency does not come from releasing summary tables and saying “it summarises that pile of data”. It comes from saying – here is the summary, and here is how it was generated from that data, allowing the observer to check the assumptions of the query, redo the analysis, and so on.
Using services such as Google spreadsheets or Zoho spreadsheets, it is possible to provide a preview view over the data contained in a dataset made available as a simple CSV file (this approach is taken on some datastores). It is also possible to use services such as a Google spreadsheets as a database, and so provide a certain level of intermediate developer access to the raw data as if read access were made available to the database that sourced the released data (eg http://blog.ouseful.info/2010/11/19/government-spending-data-explorer/ ). A range of powerful hosted statistical analysis and visualisation tools are now available that can also provide a user interface layer over over data published in such environments (“analysis at the point of delivery”). For example, the popular R environment can provide an online statistical analysis UI to online hosted datasets via services such as http://www.stat.ucla.edu/~jeroen/ggplot2.html or http://www.rstudio.org/docs/server/getting_started These tools provide an intermediary step that allow interested parties to explore datasets in situ. Recent developments with the Linked Data API ( http://www.epimorphics.com/web/tools/linked-data-api.html ) offer similar capabilities, including the ability to share persistent URLs to queries that are applied to public Linked Data stores such as those hosted under the data.gov.uk umbrella.
****Is there a role for government to stimulate innovation in the use of Open Data? If so, what is the best way to achieve this?
Allow free access to public data for personal, research, social enterprise and SME commercial research/development purposes. If the service using the data ever becomes popular, worry about how to charge for it then…