In a comment based conversation with Anne-Marie Cunningham/@amcunningham last night, it seems I’d made a few errors in the post Demographically Classed, mistakenly attributing the use of HES data by actuaries in the Extending the Critical Path report to the SIAS when it should have been a working group of (I think?!) the Institute and Faculty of Actuaries (IFoA). I’d also messed up in assuming that the HES data was annotated with ACORN and MOSAIC data by the researchers, a mistaken assumption that begged the question as to how that linkage was actually done. Anne-Marie did the journalistic thing and called the press office (seems not many hacks did…) and discovered that “The researchers did not do any data linkage. This was all done by NHSIC. They did receive it fully coded. They only received 1st half of postcode and age group. There was no information on which hospitals people had attended.” Thanks, Anne-Marie:-)
Note – that last point could be interesting: it would suggest that in the analysis the health effects were decoupled from the facility where folk were treated?
Here are a few further quick notes adding to the previous post:
– the data that will be shared by GPs will be in coded form. An example of the coding scheme is provided in this post on the A Better NHS blog – Care dot data. The actual coding scheme can be found in this spreadsheet from the HSCIC: Code set – specification for the data to be extracted from GP electronic records and described in Care Episode Statistics: Technical Specification of the GP Extract. The tech spec uses the following diagram to explain the process (p19):
I’m intrigued as to what they man by the ‘non-relational database’…?
As far as the IFoA report goes, an annotated version of this diagram to show how the geodemographic data from Experian and CACI was added, and then how personally identifiable data was stripped before the dataset was handed over to the IFoA ,would have been a useful contribution to the methodology section. I think over the next year or two, folk are going to have to spend some time being clear about the methodology in terms of “transparency” around ethics, anonymisation, privacy etc, whilst the governance issues get clarified and baked into workaday processes and practice.
Getting a more detailed idea of what data will flow and how filters may actually work under various opt-out regimes around various data sharing pathways requires a little more detail. The Falkland Surgery in Newbury provides a useful summary of what data in general GP practices share, including care.data sharing. The site also usefully provides a map of the control-codes that preclude various sharing principles (As simple as I [original site publisher] can make it!):
Returning the to care episode statistics reporting structure, the architecture to support reuse is depicted on p21 of the tech spec as follows:
There also appear to be two main summary pages of resources relating to care data that may be worth exploring further as a starting point: Care.data and Technology, systems and data – Data and information. Further resources are available more generally on Information governance (NHS England).
As I mentioned in my previous post on this topic, I’m not so concerned about personal privacy/personal data leakage as I am about trying to think trough the possible consequences of making bulk data releases available that can be used as the basis for N=All/large scale data modelling (which can suffer from dangerous (non)sampling errors/bias when folk start to opt-out), the results of which are then used to influence the development of and then algorithmic implementation of, policy. This issue is touched on in by blogger and IT, Intellectual Property and Media Law lecturer at the University of East Anglia Law School, Paul Bernal, in his post Care.data and the community…:
The second factor here, and one that seems to be missed (either deliberately or through naïveté) is the number of other, less obvious and potentially far less desirable uses that this kind of data can be put to. Things like raising insurance premiums or health-care costs for those with particular conditions, as demonstrated by the most recent story, are potentially deeply damaging – but they are only the start of the possibilities. Health data can also be used to establish credit ratings, by potential employers, and other related areas – and without any transparency or hope of appeal, as such things may well be calculated by algorithm, with the algorithms protected as trade secrets, and the decisions made automatically. For some particularly vulnerable groups this could be absolutely critical – people with HIV, for example, who might face all kinds of discrimination. Or, to pick a seemingly less extreme and far more numerous group, people with mental health issues. Algorithms could be set up to find anyone with any kind of history of mental health issues – prescriptions for anti-depressants, for example – and filter them out of job applicants, seeing them as potential ‘trouble’. Discriminatory? Absolutely. Illegal? Absolutely. Impossible? Absolutely not – and the experience over recent years of the use of black-lists for people connected with union activity (see for example here) shows that unscrupulous employers might well not just use but encourage the kind of filtering that would ensure that anyone seen as ‘risky’ was avoided. In a climate where there are many more applicants than places for any job, discovering that you have been discriminated against is very, very hard.
This last part is a larger privacy issue – health data is just a part of the equation, and can be added to an already potent mix of data, from the self-profiling of social networks like Facebook to the behavioural targeting of the advertising industry to search-history analytics from Google. Why, then, does care.data matter, if all the rest of it is ‘out there’? Partly because it can confirm and enrich the data gathered in other ways – as the Telegraph story seems to confirm – and partly because it makes it easy for the profilers, and that’s something we really should avoid. They already have too much power over people – we should be reducing that power, not adding to it. [my emphasis]
There are many trivial reasons why large datasets can become biased (for example, see The Hidden Biases in Big Data), but there are also deeper reasons why wee need to start paying more attention to “big” data models and the algorithms that are derived from and applied to them (for example, It’s Not Privacy, and It’s Not Fair [Cynthia Dwork & Deirdre K. Mulligan] and Big Data, Predictive Algorithms and the Virtues of Transparency (Part One) [John Danaher]).
The combined HES’n’insurance report, and the care.data debacle provides an opportunity to start to discuss some of these issues around the use of data, the ways in which data can be combined, the undoubted rise in data brokerage services. So for example, a quick pop over to CCR Dataand they’ll do some data enhancement for you (“We have access to the most accurate and validated sources of information, ensuring the best results for you. There are a host of variables available which provide effective business intelligence [including] [t]elephone number appending, [d]ate of [b]irth, MOSAIC”), [e]nhance your database with email addresses using our email append data enrichment service or wealth profiling. Lovely…