Archive for the ‘Data’ Category
Killer post title, eh?
Some time ago I put in an FOI request to the Isle of Wight Council for the transaction logs from a couple of ticket machines in the car park at Yarmouth. Since then, the Council made some unpopular decisions about car parking charges, got a recall and then in passing made the local BBC news (along with other councils) in respect of the extent of parking charge overpayments…
Here’s how hyperlocal news outlet OnTheWight reported the unfolding story…
- 11 new ways the council propose to make car parking more expensive
- Look again at parking and leisure centre charges, say Island Conservatives
- Increased car parking charges revealed
- Council could face legal action over car parking increases
- Council gives their view on the legal uses of car parking income
- Council claim they don’t yet know how many people wrote to them about parking changes
- Executive vote: Free parking in 24 car parks goes, including Appley and Puckpool and parking charges up
- Councillors ‘call-in’ decision on parking changes
- Date set for scrutiny of changes to parking charges
- Follow live coverage of parking changes being scrutinised (Updated) (includes a copy of the call-in notice)
- Isle of Wight car parkers overpaid £186,706.35 between 2011-13
I really missed a trick not getting involved in this process – because there is, or could be, a significant data element to it. And I had a sample of data that I could have doodled with, and then gone for the whole data set.
Anyway, I finally made a start on looking at the data I did have with a view to seeing what stories or insight we might be able to pull from it – the first sketch of my conversation with the data is here: A Conversation With Data – Car Parking Meter Data.
It’s not just the parking meter data that can be brought to bear in this case – there’s another set of relevant data too, and I also had a sample of that: traffic penalty charge notices (i.e. traffic warden ticket issuances…)
With a bit of luck, I’ll have a go at a quick conversation with that data over the next week or so… Then maybe put in a full set of FOI requests for data from all the Council operated ticket machines, and all the penalty notices issued, for a couple of financial years.
Several things I think might be interesting to look at Island-wide:
- in much the same was as Tube platforms suffer from loading problems, where folk surge around one entrance or another, do car parks “fill up” in some sort of order, eg within a car park (one meter lags the other in terms of tickets issued) or one car park lags another overall;
- do different car parks have a different balance of ticket types issued (are some used for long stay, others for short stay?) and does this change according to what day of the week it is?
- how does the issuance of traffic penalty charge notices compare with the sorts of parking meter tickets issued?
- from the timestamps of when traffic penalty charge notices tickets are issued, can we work out the rounds of different traffic warden patrols?
The last one might be a little bit cheeky – just like you aren’t supposed to share information about the mobile speed traps, perhaps you also aren’t supposed to share information that there’s a traffic warden doing the rounds…?!
I’m still hopeful of working up the idea of recreational data as a popular pastime activity with a regular column somewhere and a stocking filler book each Christmas (?!;-), but haven’t had much time to commit to working up some great examples lately:-(
However, here’s a neat idea – data golf – as described in a post by Bogumił Kamiński (RGolf) that I found via RBloggers:
There are many code golf sites, even some support R. However, most of them are algorithm oriented. A true RGolf competition should involve transforming a source data frame to some target format data frame.
So the challenge today will be to write a shortest code in R that performs a required data transformation
An example is then given of a data reshaping/transformation problem based on a real data task (wrangling survey data, converting it from a long to a wide format in the smallest amount of R.
Of course, R need not be the only language that can be used to play this game. For the course I’m currently writing, I think I’ll pitch data golf as a Python/pandas activity in the section on data shaping. OpenRefine also supports a certain number of reshaping transformations, so that’s another possible data golf course(?). As are spreadsheets. And so on…
Hmmm… thinks… pivot table golf?
In a comment based conversation with Anne-Marie Cunningham/@amcunningham last night, it seems I’d made a few errors in the post Demographically Classed, mistakenly attributing the use of HES data by actuaries in the Extending the Critical Path report to the SIAS when it should have been a working group of (I think?!) the Institute and Faculty of Actuaries (IFoA). I’d also messed up in assuming that the HES data was annotated with ACORN and MOSAIC data by the researchers, a mistaken assumption that begged the question as to how that linkage was actually done. Anne-Marie did the journalistic thing and called the press office (seems not many hacks did…) and discovered that “The researchers did not do any data linkage. This was all done by NHSIC. They did receive it fully coded. They only received 1st half of postcode and age group. There was no information on which hospitals people had attended.” Thanks, Anne-Marie:-)
Note – that last point could be interesting: it would suggest that in the analysis the health effects were decoupled from the facility where folk were treated?
Here are a few further quick notes adding to the previous post:
- the data that will be shared by GPs will be in coded form. An example of the coding scheme is provided in this post on the A Better NHS blog – Care dot data. The actual coding scheme can be found in this spreadsheet from the HSCIC: Code set – specification for the data to be extracted from GP electronic records and described in Care Episode Statistics: Technical Specification of the GP Extract. The tech spec uses the following diagram to explain the process (p19):
I’m intrigued as to what they man by the ‘non-relational database’…?
As far as the IFoA report goes, an annotated version of this diagram to show how the geodemographic data from Experian and CACI was added, and then how personally identifiable data was stripped before the dataset was handed over to the IFoA ,would have been a useful contribution to the methodology section. I think over the next year or two, folk are going to have to spend some time being clear about the methodology in terms of “transparency” around ethics, anonymisation, privacy etc, whilst the governance issues get clarified and baked into workaday processes and practice.
Getting a more detailed idea of what data will flow and how filters may actually work under various opt-out regimes around various data sharing pathways requires a little more detail. The Falkland Surgery in Newbury provides a useful summary of what data in general GP practices share, including care.data sharing. The site also usefully provides a map of the control-codes that preclude various sharing principles (As simple as I [original site publisher] can make it!):
Returning the to care episode statistics reporting structure, the architecture to support reuse is depicted on p21 of the tech spec as follows:
There also appear to be two main summary pages of resources relating to care data that may be worth exploring further as a starting point: Care.data and Technology, systems and data – Data and information. Further resources are available more generally on Information governance (NHS England).
As I mentioned in my previous post on this topic, I’m not so concerned about personal privacy/personal data leakage as I am about trying to think trough the possible consequences of making bulk data releases available that can be used as the basis for N=All/large scale data modelling (which can suffer from dangerous (non)sampling errors/bias when folk start to opt-out), the results of which are then used to influence the development of and then algorithmic implementation of, policy. This issue is touched on in by blogger and IT, Intellectual Property and Media Law lecturer at the University of East Anglia Law School, Paul Bernal, in his post Care.data and the community…:
The second factor here, and one that seems to be missed (either deliberately or through naïveté) is the number of other, less obvious and potentially far less desirable uses that this kind of data can be put to. Things like raising insurance premiums or health-care costs for those with particular conditions, as demonstrated by the most recent story, are potentially deeply damaging – but they are only the start of the possibilities. Health data can also be used to establish credit ratings, by potential employers, and other related areas – and without any transparency or hope of appeal, as such things may well be calculated by algorithm, with the algorithms protected as trade secrets, and the decisions made automatically. For some particularly vulnerable groups this could be absolutely critical – people with HIV, for example, who might face all kinds of discrimination. Or, to pick a seemingly less extreme and far more numerous group, people with mental health issues. Algorithms could be set up to find anyone with any kind of history of mental health issues – prescriptions for anti-depressants, for example – and filter them out of job applicants, seeing them as potential ‘trouble’. Discriminatory? Absolutely. Illegal? Absolutely. Impossible? Absolutely not – and the experience over recent years of the use of black-lists for people connected with union activity (see for example here) shows that unscrupulous employers might well not just use but encourage the kind of filtering that would ensure that anyone seen as ‘risky’ was avoided. In a climate where there are many more applicants than places for any job, discovering that you have been discriminated against is very, very hard.
This last part is a larger privacy issue – health data is just a part of the equation, and can be added to an already potent mix of data, from the self-profiling of social networks like Facebook to the behavioural targeting of the advertising industry to search-history analytics from Google. Why, then, does care.data matter, if all the rest of it is ‘out there’? Partly because it can confirm and enrich the data gathered in other ways – as the Telegraph story seems to confirm – and partly because it makes it easy for the profilers, and that’s something we really should avoid. They already have too much power over people – we should be reducing that power, not adding to it. [my emphasis]
There are many trivial reasons why large datasets can become biased (for example, see The Hidden Biases in Big Data), but there are also deeper reasons why wee need to start paying more attention to “big” data models and the algorithms that are derived from and applied to them (for example, It’s Not Privacy, and It’s Not Fair [Cynthia Dwork & Deirdre K. Mulligan] and Big Data, Predictive Algorithms and the Virtues of Transparency (Part One) [John Danaher]).
The combined HES’n’insurance report, and the care.data debacle provides an opportunity to start to discuss some of these issues around the use of data, the ways in which data can be combined, the undoubted rise in data brokerage services. So for example, a quick pop over to CCR Dataand they’ll do some data enhancement for you (“We have access to the most accurate and validated sources of information, ensuring the best results for you. There are a host of variables available which provide effective business intelligence [including] [t]elephone number appending, [d]ate of [b]irth, MOSAIC”), [e]nhance your database with email addresses using our email append data enrichment service or wealth profiling. Lovely…
So it seems that in a cost-recovered data release that was probably lawful then but possibly wouldn’t be now* – Hospital records of all NHS patients sold to insurers – the
Staple Inn Actuarial Society Critical Illness Definitions and Geographical Variations Working Party (of what, I’m not sure? The Institute and Faculty of Actuaries, perhaps?) got some Hospital Episode Statistics data from the precursor to the HSCIC, blended it with some geodemographic data**, and then came to the conclusion that “that the use of geodemographic profiling could refine Critical illness pricing bases” (source: Extending the Critical Path), presenting the report to the Staple Inn Actuarial Society who also headline branded the PDF version of the report? Maybe?
* House of Commons Health Committee, 25/2/14: 15.59:32 for a few minutes or so; that data release would not be approved now: 16.01:15 reiterated at 16.03:05 and 16.07:05
** or maybe they didn’t? Maybe the data came pre-blended, as @amcunningham suggests in the comments? I’ve added a couple of further questions into my comment reply… – UPDATE: “HES was linked to CACI and Experian data by the Information Centre using full postcode. The working party did not receive any identifiable data.”
CLARIFICATION ADDED (source )—-
“In a story published by the Daily Telegraph today research by the IFoA was represented as “NHS data sold to insurers”. This is not the case. The research referenced in this story considered critical illness in the UK and was presented to members of the Staple Inn Actuarial Society (SIAS) in December 2013 and was made publically available on our website.
“The IFoA is a not for profit professional body. The research paper – Extending the Critical Path – offered actuaries, working in critical illness pricing, information that would help them to ask the right questions of their own data. The aim of providing context in this way is to help improve the accuracy of pricing. Accurate pricing is considered fairer by many consumers and leads to better reserving by insurance companies.
There was also an event on 17 February 2014.
Via a tweet from @SIAScommittee, since deleted for some reason(?), this is clarified further: “SIAS did not produce the research/report.”
The branding that mislead me – I must not be so careless in future…
Many of the current agreements about possible invasions of privacy arising from the planned care.data release relate to the possible reidentification of individuals from their supposedly anonymised or pseudonymised health data (on my to read list: NHS England – Privacy Impact Assessment: care.data) but to my mind the
SIAS report presented to the SIAS suggests that we also need to think about consequences of the ways in which aggregated data is analysed and used (for example, in the construction of predictive models). Where aggregate and summarised data is used as the basis of algorithmic decision making, we need to be mindful that sampling errors, as well as other modelling assumptions, may lead to biases in the algorithms that result. Where algorithmic decisions are applied to people placed into statistical sampling “bins” or categories, errors in the assignment of individuals into a particular bin may result in decisions being made against them on an incorrect basis.
Rather than focussing always on the ‘can I personally be identified from the supposedly anonymised or pseudonymised data’, we also need to be mindful of the extent to, and ways in, which:
1) aggregate and summary data is used to produce models about the behaviour of particular groups;
2) individuals are assigned to groups;
3) attributes identified as a result of statistical modelling of groups are assigned to individuals who are (incorrectly) assigned to particular groups, for example on the basis of estimated geodemographic binning.
What worries me is not so much ‘can I be identified from the data’, but ‘are there data attributes about me that bin me in a particular way that statistical models developed around those bins are used to make decisions about me’. (Related to this are notions of algorithmic transparency – though in many cases I think this must surely go hand in hand with ‘binning transparency’!)
That said, for the personal-reidentification-privacy-lobbiests, they may want to pick up on the claim in the
SIASIFoA report (page 19) that:
In theory, there should be a one to one correspondence between individual patients and HESID. The HESID is derived using a matching algorithm mainly mapped to NHS number, but not all records contain an NHS number, especially in the early years, so full matching is not possible. In those cases HES use other patient identifiable fields (Date of Birth, Sex, Postcode, etc.) so imperfect matching may mean patients have more than one HESID. According to the NHS IC 83% of records had an NHS number in 2000/01 and this had grown to 97% by 2007/08, so the issue is clearly reducing. Indeed, our data contains 47.5m unique HESIDs which when compared to the English population of around 49m in 1997, and allowing for approximately 1m new lives a year due to births and inwards migration would suggest around 75% of people in England were admitted at least once during the 13 year period for which we have data. Our view is that this proportion seems a little high but we have been unable to verify that this proportion is reasonable against an independent source.
Given two or three data points, if this near 1-1 correspondence exists, you could possibly start guessing at matching HESIDs to individuals, or family units, quite quickly…
- ACORN (A Classification of Residential Neighbourhoods) is CACI’s geodemographic segmentation system of the UK population. We have used the 2010 version of ACORN which segments postcodes into 5 Categories, 17 Groups and 57 Types.
- Mosaic UK is Experian’s geodemographic segmentation system of the UK population. We have used the 2009 version of Mosaic UK which segments postcodes into 15 Groups and 67 Household Types.
The ACORN and MOSAIC data sets seem to provide data at the postcode level. I’m not sure how this was then combined with the HES data, but it seems the
SIASIFoA folk found a way (p 29) [or as Anne-Marie Cunningham suggests in the comments, maybe it wasn't combined by SIASIFoA - maybe it came that way?]:
The HES data records have been encoded with both an ACORN Type and a Mosaic UK Household Type. This enables hospital admissions to be split by ACORN and Mosaic Type. This covers the “claims” side of an incidence rate calculation. In order to determine the exposure, both CACI and Experian were able to provide us with the population of England, as at 2009 and 2010 respectively, split by gender, age band and profiler.
This then represents another area of concern – the extent to which even pseudonymised data can be combined with other data sets, for example based on geo-demographic data. So for example, how are the datasets actually combined, and what are the possible consequences of such combinations? Does the combination enrich the dataset in such a way that makes it easier for use to deanonymise either of the original datasets (if that is your primary concern); or does the combination occur in such a way that it may introduce systematic biases into models that are then produced by running summary statistics over groupings that are applied over the data, biases that may be unacknowedged (to possibly detrimental effect) when the models are used for predictive modelling, pricing models or as part of policy-making, for example?
Just by the by, I also wonder:
- what data was released lawfully under the old system that wouldn’t be allowed to be released now, and to whom, and for what purpose?
– are the people to whom that data was released allowed to continue using and processing that data?
– if they are allowed to continue using that data, under what conditions and for what purpose?
– if they are not, have they destroyed the data (16.05:44), for example by taking a sledgehammer to the computers the data was held on in the presences of NHS officers, or by whatever other means the state approves of?
Via a post on my colleague, and info law watchdog, Ray Corrigan’s blog – Alas medical confidentiality in the UK, we knew it well… – I note he has some concerns about the way in which the NHS data linkage service may be able to up its game as a result of the creation of the HSCIC – the Health and Social Care Information Centre – and it’s increasing access to data (including personal medical records?) held by GPs via the General Practice Extraction Service (GPES). (The HSCIC itself was established via legislation: Part 9 Chapter 2 of the Health and Social Care Act 2012. As I commented in The Business of Open Public Data Rolls On…, I think we need to keep a careful eye on (proposed) legislation that allows for “information of a prescribed description” to be made available to a “prescribed person” or “a person falling within a prescribed category”, where those prescriptions are left to the whim of the Minister responsible.) (Also via Ray, medConfidential has an interesting review of the HSCIC/GPES story so far.)
Something I hadn’t spotted before was the price list for data extraction and linkage services – just as interesting as the prices are the categories of service:
Here are the actual prices:
Complexity based on time to process-
3. A request is classed as ‘simple’ if specification, production and checking are expected to take less than 5 hours.
4. A request is classed as ‘medium’ if specification, production and checking are expected to take less than 7 hours but more than 5.
5. A request is classed as ‘complex’ if specification, production and checking are expected to take less than 12 hours but more than 7.
Doing a little search around the notion of “data linkage”, I stumbled across what looks to be quite a major data linkage initiative going on in Scotland – the Scotland Wide data linkage framework. There seems to have been a significant consultation exercise in 2012 prior to the establishment of this framework earlier this year: Data Linkage Framework Consultation [closed] [see for example the Consultation paper on “Aims and Guiding Principles” or the Technical Consultation on the Design of the Data Sharing and Linking Service [closed]]. Perhaps mindful of the fact that there may have been and may yet be public concerns around the notion of data linkage, an engagement exercise and report on Public Acceptability of Cross-Sectoral Data Linkage was also carried out (August 2012). A further round of engagement looks set so occur during November 2013.
I’m not sure what the current state of the framework, or its implementation, is (maybe this FOI request on Members and minutes of meetings of Data Linkage Framework Operations Group would give some pointers?) but one component of it at at least looks to be the Electronic Data Research and Innovation Service (eDRIS), a “one-stop shop for health informatics research”, apparently… Here’s the elevator pitch:
Some examples of collaborative work are also provided:
- Linking data from NHS24 and Scottish Ambulance Service with emergency admissions and deaths data to understand unscheduled care patient pathways.
– Working with NHS Lothian to provide linked health data to support EuroHOPE – European Healthcare Outcomes, Performance and Efficiency Project Epidemiology, disease burden and outcomes of diverticular disease in Scotland
– Infant feeding in Scotland: Exploring the factors that influence infant feeding choices (within Glasgow) and the potential health and economic benefits of breastfeeding on child health
This got me wondering about what sorts of data linkage project things like HSCIC or the MoJ data lab (as reviewed here) might get up to. Several examples seem to to provided by the ESRC Administrative Data Liaison Service (ADLS): Summary of administrative data linkage. (For more information about the ADLS, see the Administrative Data Taskforce report Improving Access for Research and Policy.)
The ADLS itself was created as part of a three phase funding programme by the ESRC, which is currently calling for second phase bids for Business and Local Government Data Research Centres. I wonder if offering data linkage services will play a role in their remit? If so, I wonder if they will offer services along the lines of the ADLS Trusted Third Party Service (TTPS), which “provides researchers and data holding organisations a mechanism to enable the combining and enhancing of data for research to which may not have otherwise been possible because of data privacy and security concerns”? Apparently,
The [ADLS TTPS] facility is housed within a secure room within the Centre for Census and Survey Research (CCSR) at the University of Manchester, and has been audited by the Office for National Statistics. The room is only used to carry out disclosure risk assessment work and other work that requires access to identifiable data.”
Another example of a secure environment for data analysis is provided by the the HMRC Datalab. One thing I notice about that facility is that they don’t appear to allow expect researchers to use R (the software list identifies STATA 9/10/11, SAS 9.3, Microsoft Excel, Microsoft Word, SPSS Clementine 8.1/9.0/10.1/11.1/12)?
Why’s this important? Because little L, little D, linked data can provide a much richer picture than distinct data sets…
PS see also mosaic theory…
PPS reminded by @wilm, here’s a “nice” example of data linkage from the New York Times… N.S.A. Gathers Data on Social Connections of U.S. Citizens.
PPPS and from the midata Innovation Lab, I notice this claim:
On the 4th of July 2013 we opened the midata Innovation Lab (mIL), on what we call “UK Consumer Independence Day”. So what is it? It’s the UK Government, leading UK companies and authoritative bodies collaborating on data services innovation and consumer protection for a data-driven future. We’ve put together the world’s fastest-built data innovation lab, creating the world’s most interesting and varied datasets, for the UK’s best brands and developers to work with.
The mIL is an accelerator for business to use a rich dataset to create new services for consumers. Designed in conjunction with innovative “Founding Partner” businesses, it also has oversight from authoritative bodies so we can create the world’s best consumer protection in the emerging personal data ecosystem.
The unique value of the lab is its ability to offer a unique dataset and consumer insight that it would be difficult for any one organization to collate. With expert input from authoritative consumer protection bodies, we can test and learn how to both empower and protect consumers in the emerging personal data ecosystem.
And this: “The personal data that we have asked for is focused on a few key areas: personal information including vehicle and property data, transactional banking and credit records, mobile, phone, broadband and TV billing information and utility bills.” It seems that data was collected from 50 individuals to start with.
Co-Director Network Data Files in GEXF and JSON from OpenCorporates Data via Scraperwiki and networkx
I’ve been tinkering with OpenCorporates data again, tidying up the co-director scraper described in Corporate Sprawl Sketch Trawls Using OpenCorporates (the new scraper is here: Scraperwiki: opencorporates trawler) and thinking a little about working with the data as a graph/network.
What I’ve been focussing on for now are networks that show connections between directors and companies, something we might imagine as follows:
In this network, the blue circles represent companies and the red circles directors. A line connecting a director with a company says that the director is a director of that company. So for example, company C1 has just one director, specifically D1; and director D2 is director of companies C2 and C3, along with director D3.
It’s easy enough to build up a graph like this from a list of “company-director” pairs (or “relations”). These can be described using a simple two column data format, such as you might find in a simple database table or CSV (comma separated value) text file, where each row defines a separate connection:
This is (sort of) how I’m representing data I’ve pulled from OpenCorporates, data that starts with a company seed, grabs the current directors of that target company, searches for other companies those people are directors of (using an undocumented OpenCorporates search feature – exact string matching on the director search (put the direction name in double quotes…;-), and then captures which of those companies share are least two directors with the original company.
In order to turn the data, which looks like this:
into a map that resembles something like this (this is actually a view over a different dataset):
we need to do a couple of things. Working backwards, these are:
- use some sort of tool to generate a pretty picture from the data;
- get the data out of the table into the tool using an appropriate exchange format.
Note that the positioning of the nodes in the visualisation may be handled in a couple of ways:
- either the visualisation tool uses a layout algorithm to work out the co-ordinates for each of the nodes; or
- the visualisation tool is passed a graph file that contains the co-ordinates saying where each node should be placed; the visualisation tool then simply lays out the graph using those provided co-ordinates.
The dataflow I’m working towards looks something like this:
networkx is a Python library (available on Scraperwiki) that makes it easy to build up representations of graphs simply by adding nodes and edges to a graph data structure. networkx can also publish data in a variety of handy exchange formats, including gexf (as used by Gephi and sigma.js), and a JSON graph representation (as used by d3.js and maybe sigma.js (example plugin?).
As a quick demo, I’ve built a scraperwiki view (opencorporates trawler gexf) that pulls on a directors_ table from my opencorporates trawler and then publishes the information either as gexf file (default) or as a JSON file using URLs of the form:
https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2 (gexf default)
This view can therefore be used to easily export data from my OpenCorporates trawler as a gexf file that can be used to easily import data into the Gephi desktop tool, or provide a URL to some JSON data that can be visualised using a Javscript library within a web page (I started doodling the mechanics of one example here: sigmajs test; better examples of what’s possible can be seen at Exploring Data and on the Oxford Internet Institute – Visualisations blog. If anyone would like to explore building a nice GUI to my OpenCorporates trawl data, feel free:-).
We can also use networks to publish data based on processing the network. The example graph above shows a netwrok with two sorts of nodes, connected by edges: company nodes and director nodes. This is a special sort of graph in that companies are only ever connected to directors, and directors are only ever connected to companies. That is, the nodes fall into one of two sorts – company or director – and they only ever connect “across” node type lines. If you look at this sort of graph (sometimes referred to as a bipartite or bimodal graph) for a while, you might be able to spot how you can fiddle with it (technical term;-) to get a different view over the data, such as those directors connected to other directors by virtue of being directors of the same company:
or those companies that are connected by virtue of sharing common directors:
(Note that the lines/edges can be “weighted” to show the number of connections relating two companies or directors (that is, the number of companies that two directors are connected by, or the number of directors that two companies are connected by). We can then visually depict this weight using line/edge thickness.)
The networkx library conveniently provides functions for generating such views over the data, which can also be accessed via my scraperwiki view:
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=companies gives the view over companies connected by virtue of the fact that they share one or more common director;
- https://views.scraperwiki.com/run/opencorporates_trawler_gexf/?key=compassFood2_2&reduce=officers&output=json gives a view over how directors are connected based on being co-directors of one or more of the same companies.
As the view is paramaterised via a URL, it can be used as a form of “glue logic” to bring data out of a directors table (which itself was populated by mining data from OpenCorporates in a particular way) and express it in a form that can be directly plugged in to a visualisation toolkit. Simples:-)
PS related: a templating system by Craig Russell for generating XML feeds from Google Spreadsheets – EasyOpenData.
A couple of weeks ago, I gave a presentation to the WebScience students at the University of Southampton on the topic of open data, using it as an opportunity to rehearse a view of open data based on the premise that it starts out closed. In much the same way that Darwin’s Theory of Evolution by Natural Selection is based on a major presupposition, specifically a theory of inheritance and the existence of processes that support reproduction with minor variation, so too does much of our thinking about open data derive from the presupposed fact that many of the freedoms we associate with the use of open data in legal terms arise from license conditions that the “owner” of the data awards to us.
Viewing data in this light, we might start by considering what constitutes “closed” data and how it comes to be so, before identifying the means by which freedoms are granted and the data is opened up. (Sometimes it can also be easier to consider what you can’t do than what you can, especially when answers to questions such as “so what can you actually do with open data?” attract the (rather meaningless) response: “anything”. We can then contrast what you can do in terms of freedom complementary to what you can’t…)
So how can data be “closed”?
One lens I particularly like for considering constraints that are placed on actions and actors, particularly in the digital world (although we can apply the model elsewhere) I first saw described by Lawrence Lessig in Code and Other Laws of Cyberspace: What Things Regulate: A Dot’s Life.
Here’s the dot and the forces that constrain its behaviour:
So we see, for example, the force of law, social norms, the market (that is, economic forces) and architecture, that is the “digital physical” way the world is implemented. (Architecture may of course be designed in order to enforce particular laws, but it is likely that other “natural laws” will arise as a result of any particular architecture or system implementation.)
Without too much thought, we might identify some constraints around data and its use under each of these separate lenses. For example:
- Law: copyright and database right grant the creator of a dataset certain protective rights over that data; data protection laws (and other “privacy laws”) limit access to, or disclosure of, data that contains personal information, as well as restricting the use of that data for purposes disclosed at the time it was collected. The UK Data Protection Act also underwrites the right of individuals to claim additional limits on data use, for example the rights “to object to processing that is likely to cause or is causing damage or distress to prevent processing for direct marketing; to object to decisions being taken by automated means” (ICO Guide to the DPA, Principle 6 – The rights of individuals).
- Norms: social mores, behaviour and taboos limit the ways in which we might use data, even if that use is not constrained by legal, economic or technical concerns. For example, applications that invite people to “burgle my house” based on analysing social network data to discover when they are likely to be away from home and what sorts of valuable product might be on the premises are generally not welcomed. Norms of behaviour and everyday workpractice also mean that much data is not published when theere are no real reasons why it couldn’t be.
- Market: in the simplest case, charging for access to data places a constraint on who can gain access to the data even in advance of trying to make use of it. If we extend “market” to cover other financial constraints, there may be a cost associated with preparing data so that it can be openly released.
- Architecture: technical constraints can restrict what you can do with data. Digital rights management (DRM) uses encryption to render data streams unusable to all but the intended client, but more prosaically, document formats such as PDF or the “release” of data charts are flat image files makes it difficult for the end user to manipulate as data any data resources contained in those documents.
Laws can also be used to grant freedoms where freedoms are otherwise restricted. For example:
- the Freedom of Information Act (FOI) provides a mechanism for requesting copies of datasets from public bodies; in addition, the Environmental Information Regulations “provide public access to environmental information held by public authorities”.
- the laws around copyright relax certain copyright constraints for the purposes of criticism and review, reporting, research, teaching (IPO – Permitted uses of copyright works);
- in the UK, the Data Protection Act provides for “a right of access to a copy of the information comprised in their personal data” (ICO Guide to the DPA, Principle 6).
- in the UK, the Data Protection Act regulates what can be done legitimately with “personal” data. However, other pieces of legislation relax confidentiality requirements when it comes to sharing data for research purposes. For example:
- the NHS Act s. 251 Control of patient information; for example, the Secretary of State for Health may “make regulations to set aside the common law duty of confidentiality for medical purposes where it is not possible to use anonymised information and where seeking individual consent is not practicable” (discussion). Note that they are changes afoot regarding s. 251…
- The Secretary of State for Education has specific powers to share pupil data from the National Pupil database (NPD) “with named bodies and third parties who require access to the data to undertake research into the educational achievements of pupils”. The NPD “tracks a pupil’s progress through schools and colleges in the state sector, using pupil census and exam information. Individual pupil level attainment data is also included (where available) for pupils in non-maintained and independent schools” (access arrangements).
- the Enterprise and Regulatory Reform Bill currently making its way through Parliament legislates around the Supply of Customer Data (the “#midata” clauses) which is intended to open up access to customer transaction data from suppliers of energy, financial services and mobile phones “(a) to a customer, at the customer’s request; (b) to a person who is authorised by a customer to receive the data, at the customer’s request or, if the regulations so provide, at the authorised person’s request.” Although proclaimed as a way of opening up individual rights to access this data, the effect will more likely see third parties enticing individuals to authorise the release to the third party of the individual first party’s personal transaction data held by a second party (for example, #Midata Is Intended to Benefit Whom, Exactly?). (So you’ll presumably legally be able to grant Facebook access to your mobile phone records… Or Facebook will find a way of getting you to release that data to them without you realising you granted them that permission;-)
Contracts (which I guess fall somewhere between norms and laws from the dot’s perspective (I need to read that section of Lessig’s book again!) can also be used by rights holders to grant freedoms over the data they hold the rights for. For example, the Creative Commons licensing framework provides a copyright holder with a set of tools for relaxing some of the rights afforded to them by copyright when they license the work accordingly.
Note that “I am not a lawyer”, so my understanding of all this is pretty hazy;-) I also wonder how the various pieces of legislation interact, and whether there are cracks and possible inconsistencies between them? If there are pieces of legislation around the regulation and use of data that I’m missing, please post links in the comments below, and I’ll try and do a more thorough round up in a follow on post.
A handful of open Linked Data have appeared through my feeds in the last couple of days, including (via RBloggers) SPARQL with R in less than 5 minutes, which shows how to query US data.gov Linked Data and then Leigh Dodds’ Brief Review of the Land Registry Linked Data.
I was going to post a couple of of examples merging those two posts – showing how to access Land Registry data via Leigh’s example queries in R, then plotting some of the results using ggplot2, but another post of Leigh’s today – SPARQL-doc – a simple convention for documenting individual SPARQL queries, has sparked another thought…
For some time I’ve been intrigued by the idea of a marketplace in queries over public datasets, as well as the public sharing of generally useful queries. A good query is like a good gold pan, or a good interview question – it can get a dataset to reveal something valuable that may otherwise have laid hidden. Coming up with a good query in part requires having a good understanding of the structure of a dataset, in part having an eye for what sorts of secret the data may contain: the next step is crafting a well phrased query that can tease that secret out. Creating the query might take some time, some effort, and some degree of expertise in query optimisation to make it actually runnable in reasonable time (which is why I figure there may be a market for such things*) but once written, the query is there. And if it can be appropriately parameterised, it may generalise.
(*There are actually a couple of models I can think of: 1) I keep the query secret, but run it and give you the results; 2) I license the “query source code” to you and let you run it yourself. Hmm, I wonder: do folk license queries they share? How, and to what extent, might derived queries/query modifications be accommodated in such a licensing scheme?)
Pondering Leigh’s SPARQL-doc post, another post via R-bloggers, Building a package in RStudio is actually very easy (which describes how to package a set of R files for distribution via github), asdfree (analyze survey data for free), a site that “announces obsessively-detailed instructions to analyze us government survey data with free tools” (and which includes R bundles to get you started quickly…), the resource listing Documentation for package ‘datasets’ version 2.15.2 that describes a bundled package of datasets for R and the Linked Data API, which sought to provide a simple RESTful API over SPARQL endpoints, I wondered the following:
How about developing and sharing commented query libraries around Linked Data endpoints that could be used in arbitrary Linked Data clients?
(By “Linked Data clients”, I mean different user agent contexts. So for example, calling a query from Python, or R, or Google Spreadsheets.) That’s it… Simple.
One approach (the simplest?) might be to put each separate query into a separate file, with a filename that could be used to spawn a function name that could be used to call that query. Putting all the queries into a directory and zipping them up would provide a minimal packaging format. An additional manifest file might minimally document the filename along with the parameters that can be passed into and returned from the query. Helper libraries in arbitrary languages would open the query package and “compile” a programme library/set of “API” calling functions for that language (so for example, in R it would create a set of R functions, in Python a set of Python functions).
(This reminds me of a Twitter exchange with Nick Jackson/@jacksonj04 a couple of days ago around “self-assembling” API programme libraries that could be compiled in an arbitrary language from a JSON API, cf. Swagger (presentation), which I haven’t had time to look at yet.)
The idea, then is this:
- Define a simple file format for declaring documented SPARQL queries
- Define a simple packaging format for bundling separate SPARQL queries
- The simply packaged set of queries define a simple “raw query” API over a Linked Data dataset
- Describe a simple protocol for creating programming language specific library wrappers around API from the query bundle package.
So.. I guess two questions arise: 1) would this be useful? 2) how hard could it be?
[See also: @ldodds again, on Publishing SPARQL queries and-documentation using github]
Following the official opening of the Open Data Institute (ODI) last week, a flurry of data related announcements this week:
- A big one for stats fans with the release of 2011 Census data by the ONS: 2011 Census, Key Statistics for Local Authorities in England and Wales. A few charts appear to have made it into the mix (along with the data to generate them), which I guess sets the baseline for whoever lands the currently advertised Head of Rich Content at the ONS job…
The data files associated with press releases are published as Excel spreadsheets. I guess this reflects, in part, the need to come up with a container that can cope with all the metadata. It’s a bit of a pain, though. One thing I keep meaning to explore further are ways of bundling data in R packages, along with scripts for analysing and visualising the data so bundled (eg US Census Spatial and Demographic Data in R: The UScensus2000 Suite of Packages or US consumer expenditure survey (ce) in R). I probably should also look again at Google’s Dataset Publication Language (DSPL) as well as other packaging formats. I need to check out the latest major release from the W3C Provenance Working Group too…
- Over at BIS, £8 million of investment in open public data is announced, the major chunk of which goes to the Data Strategy Board (#datastrategy) Breakthrough Fund to help public bodies get over short term technical barriers to releasing open public data. I keep wittering on about mapping out data flows that already exist and then finding ways to tap into them directly, so won’t repeat that here;-) A smaller pot, administered by the ODI, will be available to SMEs via the Open Data Immersion Programme. Also announced, the Ordnance Survey will be widening the availability of its range of mapping data.
- Not sure if I missed this when it was presumably announced? The Data Strategy Board’s chair Stephan Shakespeare (CEO of YouGov Plc) is leading an independent review of public sector information (here are the (draft) terms of reference). I’m not sure how this review fits into the reports to the tangle of reporting lines associated with the Data Strategy Board and the Public Data Group (the latter seems to have been very quiet?). I also wonder where the ODI fits into that whole structure?
- The funding around public open data coincided with a written Ministerial statement form the Cabinet Office that provided an Update on Departmental Open Data Commitments and adherence to Public Data Principles (>original link on a gov.uk domain, h/t @owenboswarva). The update is spectacularly lacking in linking to any of the raw data that is summarised in the actual statement, so so much for any actual transparency there… The same minister, Francis Maude, has also been fulfilling his social media obligations with a piece in the Huffington Post on A Practical Vision for Open Government. (In other news, at the micro/pragmatic level of open public data, I’m still finding that week on week releases of NHS sitrep data show minor differences in formatting and occasional errors…)
Things have been moving on the Communications Data front too. Communications Data got a look in as part of the 2011/2012 Security and Intelligence Committee Annual Report with a review of what’s currently possible and “why change may be necessary”. Apparently:
118. The changes in the telecommunications industry, and the methods being used by people to communicate, have resulted in the erosion of the ability of the police and Agencies to access the information they require to conduct their investigations. Historically, prior to the introduction of mobile telephones, the police and Agencies could access (via CSPs, when appropriately authorised) the communications data they required, which was carried exclusively across the fixed-line telephone network. With the move to mobile and now internet-based telephony, this access has declined: the Home Office has estimated that, at present, the police and Agencies can access only 75% of the communications data that they would wish, and it is predicted that this will significantly decline over the next few years if no action is taken. Clearly, this is of concern to the police and intelligence and security Agencies as it could significantly impact their ability to investigate the most serious of criminal offences.
N. The transition to internet-based communication, and the emergence of social networking and instant messaging, have transformed the way people communicate. The current legislative framework – which already allows the police and intelligence and security Agencies to access this material under tightly defined circumstances – does not cover these new forms of communication. [original emphasis]
Elsewhere in Parliament, the Joint Select Committee Report on the Draft Communications Data Bill was published and took a critical tone (Home Secretary should not be given carte blanche to order retention of any type of data under draft communications data bill, says joint committee. “There needs to be some substantial re-writing of the Bill before it is brought before Parliament” adds Lord Blencathra, Chair of the Joint Committee.) Friend and colleague Ray Corrigan links to some of the press reviews of the report here: Joint Committee declare CDB unworkable.
In other news, Prime Minister David Cameron’s announcement of DNA tests to revolutionise fight against cancer and help 100,000 patients was reported via a technology angle – Everybody’s DNA could be on genetic map in ‘very near future’ [Daily Telegraph] – as well as by means of more reactionary headlines: Plans for NHS database of patients’ DNA angers privacy campaigners [Guardian], Privacy fears over DNA database for up to 100,000 patients [Daily Telegraph].
If DNA is your thing, don’t forget that the Home Office already operates a National DNA Database for law enforcement purposes.
And if national databases are your thing, there always the National Pupil Database which was in the news recently with the launch of a consultation on proposed amendments to individual pupil information prescribed persons regulations which seeks to “maximise the value of this rich dataset” by widening access to this data. (Again, Ray provides some context and commentary: Mr Gove touting access to National Pupil Database.)
PS A late inclusion: DECC announcement around smart meter rollout with some potential links to #midata strategy (eg “suppliers will not be able to use energy consumption data for marketing purposes unless they have explicit consent”). A whole raft of consultations were held around smart metering and Govenerment responses are also published today, including Government Response on Data Access and Privacy Framework, the Smart Metering Privacy Impact Assessment and a report on public attitudes research around smart metering. I also spotted an earlier consultation that had passed me by around the Data and Communications Company (DCC) License Conditions; here the response, which opens with: “The communications and data transfer and management required to support smart metering is to be organised by a new central communications body – the Data and Communications Company (“the DCC”). The DCC will be a new licensed entity regulated by the Gas and Electricity Markets Authority (otherwise referred to as “the Authority”, or “Ofgem”). A single organisation will be granted a licence under each of the Electricity and Gas Acts (there will be two licences in a single document, referred to as the “DCC Licence”) to provide these services within the domestic sector throughout Great Britain”. Another one to put on the reading pile…
Putting a big brother watch hat on, the notion of “meter surveillance” brings to mind BBC article about an upcoming (will hopefully thence be persistently available on iPlayer?) radio programme on “Electric Network Frequency (ENF) analysis”, The hum that helps to fight crime. According to Wikipedia, ENF is a forensic science technique for validating audio recordings by comparing frequency changes in background mains hum in the recording with long-term high-precision historical records of mains frequency changes from a database. In turn, this reminds me of appliance signature detection (identifying what appliance is switched on or off from its electrical load curve signature), for example Leveraging smart meter data to recognize home appliances. In context of audio surveillance, how about supplementing surveillance video cameras with microphones? Public Buses Across Country [US] Quietly Adding Microphones to Record Passenger Conversations.
The launch or official opening or whatever it was of the Open Data Institute this week provided another chance to grab a snapshot of notable folk in the community, as for example demonstrated by people commonly followed by users of the #ODIlaunch hashtag on Twitter. The PR campaign also resulted in the appearance of some open data related use cases, such as a report in the Economist about an analysis by MastodonC and Prescribing Analytics mapping prescription charges (R code available), with a view to highlighting where prescriptions for branded, as opposed to the recommended generic, drugs are being issued at wasteful expense to the NHS. (See Exploring GP Practice Level Prescribing Data for some of my entry level doodlings with prescription data.)
Quite by chance, I’ve been looking at some other health data recently, (Quick Shiny Demo – Exploring NHS Winter Sit Rep Data), which has been a real bundle of laughs. Looking at a range of health related datasets, data seems to be published at a variety of aggregation levels – individual practices and hospitals, Primary Care Trusts (PCTs), Strategic Health Authorities (SHAs) and the new Clinical Commissioning Groups (CCGs). Some of these map on to geographical regions, that can then be coloured according to a particular measure value associated with that area.
I’ve previously experimented with rendering shapefiles and choropleth maps (Amateur Mapmaking: Getting Started With Shapefiles) so I know R provides one possible environment for generating these maps, so I thought I’d try to pull together a recipe or two for supporting the creation of thematic maps based on health related geographical regions.
A quick trawl for PCT shapefiles turned up nothing useful. @jenit suggested @mastodonc, and @paulbradshaw pointed me to a dataset on Google Fusion Tables, discovered through the Fusion Tables search engine, that included PCT geometry data. So no shapefiles, but there is exportable KML data from Fusion Tables.
At this point I should have followed Paul Bradshaw’s advice, and just uploaded my own data (I was going to start out with mapping per capita uptake of dental services by PCT) to Fusion Tables, merging with the other data set, and generating my thematic maps that way.
But that wasn’t quite the point, which was actually an exercise in pulling together an R based recipe for generating these maps…
Anyway, I’ve made a start, and here’s the code I have to date:
##Example KML: https://dl.dropbox.com/u/1156404/nhs_pct.kml ##Example data: https://dl.dropbox.com/u/1156404/nhs_dent_stat_pct.csv install.packages("rgdal") library(rgdal) library(ggplot2) #The KML data downloaded from Google Fusion Tables fn='nhs_pct.kml' #Look up the list of layers ogrListLayers(fn) #The KML file was originally grabbed from Google Fusion Tables #There's only one layer...but we still need to identify it kml=readOGR(fn,layer='Fusiontables folder') #This seems to work for plotting boundaries: plot(kml) #And this: kk=fortify(kml) ggplot(kk, aes(x=long, y=lat,group=group))+ geom_polygon() #Add some data into the mix #I had to grab a specific sheet from the original spreadsheet and then tidy the data little... nhs <- read.csv("nhs_dent_stat_pct.csv") kml@data=merge(kml@data,nhs,by.x='Name',by.y='PCT.ONS.CODE') #I think I can plot against this data using plot()? plot(kml,col=gray(kml@data$A.30.Sep.2012/100)) #But is that actually doing what I think it's doing?! #And if so, how can experiment using other colour palettes? #But the real question is: HOW DO I DO COLOUR PLOTS USING gggplot? ggplot(kk, aes(x=long, y=lat,group=group)) #+ ????
Here’s what an example of the raw plot looks like:
And the greyscale plot, using one of the dental services uptake columns:
Here’s the base ggplot() view:
However, I don’t know how to actually now plot the data into the different areas? (Oh – might this help? CRAN Task View: Analysis of Spatial Data.)
If you know how to do the colouring, or ggplotting, please leave a comment, or alternatively, chip in an answer to a related question I posted on StackOverflow: Plotting Thematic Maps from KML Data Using ggplot2
PS The recent Chief Medical Officer’s Report makes widespread use of a whole range of graphical devices and charts, including cartograms:
Is there R support for cartograms yet, I wonder?! (Hmmm… maybe?)
PPS on the public facing national statistics front, I spotted this job ad yesterday – Head of Rich Content Development, ONS:
The postholder is responsible for inspiring and leading development of innovative rich content outputs for the ONS website and other channels, which anticipate and meet user needs and expectations, including those of the Citizen User. The role holder has an important part to play in helping ONS to realise its vision “for official statistics to achieve greater impact on key decisions affecting the UK and to encourage broader use across the country”.
1.Inspires, builds, leads and develops a multi-disciplinary team of designers, developers, data analysts and communications experts to produce innovative new outputs for the ONS website and other channels.
2. Keeps abreast of emerging trends and identifies new opportunities for the use of rich web content with ONS outputs.
3. Identifies new opportunities, proposes new directions and developments and gains buy in and commitment to these from Senior Executives and colleagues in other ONS business areas.
4. Works closely with business areas to identify, assess and commission new rich-content projects.
5. Provides, vision, guidance and editorial approval for new projects based on a continual understanding of user needs and expectations.
6. Develops and manages an ongoing portfolio of innovative content, maximising impact and value for money.
7. Builds effective partnerships with media to increase outreach and engagement with ONS content.
8. Establishes best practice in creation of rich content for the web and other channels, and works to improve practice and capability throughout ONS.