Demographically Classed

So it seems that in a cost-recovered data release that was probably lawful then but possibly wouldn’t be now* – Hospital records of all NHS patients sold to insurers – the Staple Inn Actuarial Society Critical Illness Definitions and Geographical Variations Working Party (of what, I’m not sure? The Institute and Faculty of Actuaries, perhaps?) got some Hospital Episode Statistics data from the precursor to the HSCIC, blended it with some geodemographic data**, and then came to the conclusion that “that the use of geodemographic profiling could refine Critical illness pricing bases” (source: Extending the Critical Path), presenting the report to the Staple Inn Actuarial Society who also headline branded the PDF version of the report? Maybe?

* House of Commons Health Committee, 25/2/14: 15.59:32 for a few minutes or so; that data release would not be approved now: 16.01:15 reiterated at 16.03:05 and 16.07:05

** or maybe they didn’t? Maybe the data came pre-blended, as @amcunningham suggests in the comments? I’ve added a couple of further questions into my comment reply… – UPDATE: “HES was linked to CACI and Experian data by the Information Centre using full postcode. The working party did not receive any identifiable data.”

CLARIFICATION ADDED (source )—-

“In a story published by the Daily Telegraph today research by the IFoA was represented as “NHS data sold to insurers”. This is not the case. The research referenced in this story considered critical illness in the UK and was presented to members of the Staple Inn Actuarial Society (SIAS) in December 2013 and was made publically available on our website.

“The IFoA is a not for profit professional body. The research paper – Extending the Critical Path – offered actuaries, working in critical illness pricing, information that would help them to ask the right questions of their own data. The aim of providing context in this way is to help improve the accuracy of pricing. Accurate pricing is considered fairer by many consumers and leads to better reserving by insurance companies.

There was also an event on 17 February 2014.

Via a tweet from @SIAScommittee, since deleted for some reason(?), this is clarified further: “SIAS did not produce the research/report.”

rebuttal2

The branding that mislead me – I must not be so careless in future…

misleadingBranding

——
Many of the current agreements about possible invasions of privacy arising from the planned care.data release relate to the possible reidentification of individuals from their supposedly anonymised or pseudonymised health data (on my to read list: NHS England – Privacy Impact Assessment: care.data) but to my mind the SIAS report presented to the SIAS suggests that we also need to think about consequences of the ways in which aggregated data is analysed and used (for example, in the construction of predictive models). Where aggregate and summarised data is used as the basis of algorithmic decision making, we need to be mindful that sampling errors, as well as other modelling assumptions, may lead to biases in the algorithms that result. Where algorithmic decisions are applied to people placed into statistical sampling “bins” or categories, errors in the assignment of individuals into a particular bin may result in decisions being made against them on an incorrect basis.

Rather than focussing always on the ‘can I personally be identified from the supposedly anonymised or pseudonymised data’, we also need to be mindful of the extent to, and ways in, which:

1) aggregate and summary data is used to produce models about the behaviour of particular groups;
2) individuals are assigned to groups;
3) attributes identified as a result of statistical modelling of groups are assigned to individuals who are (incorrectly) assigned to particular groups, for example on the basis of estimated geodemographic binning.

What worries me is not so much ‘can I be identified from the data’, but ‘are there data attributes about me that bin me in a particular way that statistical models developed around those bins are used to make decisions about me’. (Related to this are notions of algorithmic transparency – though in many cases I think this must surely go hand in hand with ‘binning transparency’!)

That said, for the personal-reidentification-privacy-lobbiests, they may want to pick up on the claim in the SIASIFoA report (page 19) that:

In theory, there should be a one to one correspondence between individual patients and HESID. The HESID is derived using a matching algorithm mainly mapped to NHS number, but not all records contain an NHS number, especially in the early years, so full matching is not possible. In those cases HES use other patient identifiable fields (Date of Birth, Sex, Postcode, etc.) so imperfect matching may mean patients have more than one HESID. According to the NHS IC 83% of records had an NHS number in 2000/01 and this had grown to 97% by 2007/08, so the issue is clearly reducing. Indeed, our data contains 47.5m unique HESIDs which when compared to the English population of around 49m in 1997, and allowing for approximately 1m new lives a year due to births and inwards migration would suggest around 75% of people in England were admitted at least once during the 13 year period for which we have data. Our view is that this proportion seems a little high but we have been unable to verify that this proportion is reasonable against an independent source.

Given two or three data points, if this near 1-1 correspondence exists, you could possibly start guessing at matching HESIDs to individuals, or family units, quite quickly…

To ground the binning idea slightly more, here are the geodemographic bins that the report used. They are taken from two widely used geodemographic segmentation tools, ACORN and MOSAIC.

ACORN (A Classification of Residential Neighbourhoods) is CACI’s geodemographic segmentation system of the UK population. We have used the 2010 version of ACORN which segments postcodes into 5 Categories, 17 Groups and 57 Types.

demog_segments4

demog_segments3

Mosaic UK is Experian’s geodemographic segmentation system of the UK population. We have used the 2009 version of Mosaic UK which segments postcodes into 15 Groups and 67 Household Types.

demog_segments2

demog_segments

The ACORN and MOSAIC data sets seem to provide data at the postcode level. I’m not sure how this was then combined with the HES data, but it seems the SIASIFoA folk found a way (p 29) [or as Anne-Marie Cunningham suggests in the comments, maybe it wasn’t combined by SIASIFoA – maybe it came that way?]:

The HES data records have been encoded with both an ACORN Type and a Mosaic UK Household Type. This enables hospital admissions to be split by ACORN and Mosaic Type. This covers the “claims” side of an incidence rate calculation. In order to determine the exposure, both CACI and Experian were able to provide us with the population of England, as at 2009 and 2010 respectively, split by gender, age band and profiler.

This then represents another area of concern – the extent to which even pseudonymised data can be combined with other data sets, for example based on geo-demographic data. So for example, how are the datasets actually combined, and what are the possible consequences of such combinations? Does the combination enrich the dataset in such a way that makes it easier for use to deanonymise either of the original datasets (if that is your primary concern); or does the combination occur in such a way that it may introduce systematic biases into models that are then produced by running summary statistics over groupings that are applied over the data, biases that may be unacknowedged (to possibly detrimental effect) when the models are used for predictive modelling, pricing models or as part of policy-making, for example?

Just by the by, I also wonder:

– what data was released lawfully under the old system that wouldn’t be allowed to be released now, and to whom, and for what purpose?
– are the people to whom that data was released allowed to continue using and processing that data?
– if they are allowed to continue using that data, under what conditions and for what purpose?
– if they are not, have they destroyed the data (16.05:44), for example by taking a sledgehammer to the computers the data was held on in the presences of NHS officers, or by whatever other means the state approves of?

See also: Is the UK Government Selling You Off?. For more on data linkage, see Some Random Noticings About Data Linkage.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

38 thoughts on “Demographically Classed”

  1. Some sensible commentary! I wondered about the data linkage too and (maybe wrongly) presumed that it had been done by the HSCIC. I don’t think any NHS data should be released to a 3rd party in a form which would allow it to be linked to another dataset. All the linking should be done by HSCIC or equivalent.

    1. @Anne-Marie Thanks… I took it that the linkage was done by the SIAS,but maybe it came pre-coded… hmmm… Ah: “The HES data we received was coded with the classifications from three geodemographic profilers”. So is that: “the data received with the codes included”, or is the sense of ‘was coded’ to be read as ‘we then further coded the data’?

      If the data was supplied with the demographic codes included, then:

      1) does that additional coding information provide cribs that folk can use to try to deanonymise the data?
      2) if not 1), does it provide any further ways to pivot/JOIN in additional data from other Experian etc products, using those codes and any other cribbale data from the HES data? (What geo data does HES data include at individual record level?)
      3) if the geodem data was supplied mixed in to the HES data, do the commercial geodem licenses allow that reuse? Is NHS paying for commercial geodem data that it can essentially sublicense to third parties (and are those costs covered in the cost recovery?)

      1. The geodem data was given free I think… it’s mentioned in the acknowledgements [TH: “CACI and Experian, who kindly donated the use of their main Geodemographic profiling tools, ACORN and Mosaic respectively, to assist in our quantification of the variation in experience by socio-economic status.”]. To be able to do the linking themselves I *think* that full postcode would be needed and I can’t see how it would be reasonable to release that. It’s also been by SIAS that they received no identifiable info and fullpostcode would count as that I think.
        I couldn’t see any analysis of the data on basis of location at all so I didn’t think that postcode needed to have been released at all.
        If no info on hospital or postcode was given then I think this was pretty non-identifiable.
        AM

        1. Ok – so I think I should have reflected a little more on the actual words in the report and where they appear. So in the HES data section we have:

          “The data is highly detailed and the data fields fall into four main categories:
          • Clinical information about a patient’s diagnoses and treatments;
          • Demographic information about the patient, such as their age and gender
          • Administrative information, for example date of admission and discharge; and
          • Geographical information on the location of treatment and the area in which the patient lives.
          Uniquely for our dataset, the last of these categories has been augmented with the addition of the geodemographic variables from Experian and CACI as discussed in section 4.”

          This suggests (I guess) that the data was enriched before it was passed to SIAS. So presumably the discussion went something like:

          SIASIFoA: We can haz data?
          NHSIC: Okay…
          SIASIFoA: Erm… Experian, CACI, we can haz ur data?? If so, erm, NHSIC, u can haz their data and ur data ina blender an’ gizzit ‘ere?
          Experian, CACI, NHSIC: Okay…
          SIASIFoA: Ta muchly…

          And what do Experian and CACI get out of it?

          1. That’s a good question. This is proof of concept work.If say insurers wanted to use Caci/Experian data then they would have to license it. That’s my reading.

            1. Or it acts as a proof of concept that Experian/CACI data can in principle be blended with HES data, *and* made available to third parties, and all that’s required for similar in the future is an appropriate contract, and, erm, exchange of pennies? Which would essentially make the NHS a reseller of Experian/CACI data? Or am I going too far down this rabbit hole?!

              1. No you’re right about that too. But in this case it wouldn’t be HSCIC who don’t make money but CPRD who have a market-based model and who are already working with other insurers as far as I know.
                A quick google pulled up this by Open Rights https://www.openrightsgroup.org/ourwork/reports/open-data,-privacy-and-anonymisation-briefing – it mentions a link up between Experian and public sector data but the source link is dead. Several people from Experian participated in this review https://www.gov.uk/government/uploads/system/uploads/attachment_data/file/198752/13-744-shakespeare-review-of-public-sector-information.pdf

                But I still think that the real money for Experia is in the ongoing licenses to industry. Maybe I’m wrong!

                1. [CPRD – http://www.cprd.com/intro.asp – Clinical Practice Research Datalink ] “CPRD services are designed to maximise the way anonymised NHS clinical data can be linked to enable many types of observational research and deliver research outputs that are beneficial to improving and safeguarding public health.”

                  So CPRD act as trusted party in running queries and blending NHS data, reselling data annotations made up from eg Experian data?

                  1. CPRD has data on about 5 million patients. GPs opt their practice in. It had been that only data from one GP system (INPS Vision) contributed primary care data to CPRD but there are moves afoot to get data from all computer systems…might be related to similar systems being put in place for care.data….???

                    The primary care data is then linked with HES. But the data linking is done by HSCIC who are the body that recieives the data in the 1st place. CPRD gets the data in a pseudonymised form. Could they then do the data linkage to eg Experian? I don’t know. Maybe all of that would be done by HSCIC and CPRD would just manage contract etc.

                    1. So from the General Practice Extraction Service (GPES) overview – http://www.hscic.gov.uk/article/2226/GPES-overview – I guess (which given the errors already picked up on and fixed in this post is probably a dangerous move?!) that there are four main companies in all providing similar GP practice system suppliers IT services: EMIS, TPP, Microtest and INPS?

                      From what you suggest, some diagrams with flows and boundary lines would be useful to help explain the info flows and what codes/level of detail actually make it through what gateways…?!

    2. Replying to myself in lieu of writing a blog post just yet. The researchers did not do any data linkage. This was all done by NHSIC. They did receive it fully coded. They only received 1st half of postcode and age group. There was no information on which hospitals people had attended.

      Stunningly I am one of *three* people to contact the press office- the others were another blogger and the BBC.

      What has happened to journalism?? We seriously need data literate journalists.

      1. @Anne-Marie Great work:-) Postcode district is quite a broad area, and depending on the width of the age group bands it would widen things further. It’d be interesting to know whether, if you have your own name/postcode/date-of-birth dataset and then ran that against the geodem datasets, there is a possibility of doing a JOIN between your geodem annotations and the geodem data supplied with the HES data (along with postcode region and age bin, which could be derived from a full name/postcode/date-of-birth dataset)?

        1. ahh so that it the process one would go through to re-identify. Why would that *not* be possible? How could it be prevented?

          1. In absense of seeing what the data actually looks like, I’m not sure if the resolution is there to allow the match; on reflection – and I really should do more of that! – it probably wouldn’t be possible: there aren’t enough levels in the geodem groups, and folk are only put into one geodem group by each scheme. If there were several geodem attributes, combinatorics would make for a much more detailed signature. As I guess the expression goes, this another: my bad…:-(

          2. To qualify further, there are at most 57 * 67 = 3819 combinations of ACORN and MOSAIC levels, though I guess that some of the levels may be more commonly paired across coding schemes than others. So the addition of these codes is not that discriminatory given the number of people likely to fall in to a given postcode district.

            Estimating numbers, for ~3000 postcode districts and ~30 million household addresses, that would give ~10,000 households per postcode district. IW has 12 postcode districts and population of the order 140,000, which gives ~12,000 people per district on a simple average. For 2 persons per household this would be ~6,000 household addresses, so the orders of magnitude look about right…

            If you did get an even spread across all combinations of ACORN and MOSAIC levels, (call it 3000 combinations), on the Isle of Wight these would give 12,000/3,000 = 4 people per combination per level. But I suspect that the distribution across (ACORN,MOSAIC) pairs is very uneven.

            Of course, if you start to factor in disease, gender and age band levels, you get a better signature. 2 genders, how many age band levels? [5 year bands – so call it ~12 age bands, which gives ~24 (age,gender) pairs] This provides a slightly more detailed signature. Then the actual care codes, which is where you start to get opportunities for a much finer detailed signature.

            \via @markhawker, “a count of usual residents broken down by sex and a count of the number of households with one or more usual residents for each postcode within England and Wales.” https://www.nomisweb.co.uk/census/2011/postcode_headcounts_and_household_estimates

            My back of envelope: ~12k in district, ~24 (age_5,gender) pairs; ~12000/24 = ~500 people per (age_5,gender) band

            Mark Hawker’s estimate done via data, age structure estimates (http://en.wikipedia.org/wiki/Demography_of_the_United_Kingdom#Age_structure) and spreadsheet wrangling: “The ‘average’ I get is 646. Based on ~25k in postcode and 38 age groupings across male/female.”

            UPDATE: it seems (Critical Path Report, Appendix 6) that the age segments were 1 year wide for ages 18-85. If we call this 60 age bands, that’s 120 (age,gender) bands, which gives ~12000/120 = ~100 people per band. If these were split across 10 geodem categories, that would get us 10 people per (age, gender,geodem) category

            1. By using the hooks on the HES geodem annotated data, I guess in principle someone could then try to dig a little deeper by trying to JOIN that data with finer grained geodem data.

              For example, if you can get geodem data at finer postcode levels via some other means, you could presumably also start to try breaking the district level postcode groupings down into finer grained postcode areas (eg there may be 1000 people in one particular (ACORN,MOSAIC) pairing, but this may be concentrated across a few specific postcode sectors, maybe with some small outliers in specific unit postcodes [http://www.ons.gov.uk/ons/guide-method/geography/beginner-s-guide/postal/index.html].

              This is all just thought experiment stuff of course… I’ve not tried it and haven’t seen actual data, so don’t know how feasible it would be. In general though, if you can join one data set with another it sometimes allows you to carve the combined set up into finer segments.

              Another way of getting finer slices is to use data collected over time/at different time slices. If someone has a particular care signature and you collect data each year for 6 years, and if the age bands are five years wide, and if the data contains historical care records, you can try to match care signatures across years to tunnel down to age in years rather than 5 year band. Of course, with any two years data collections, you may get lucky on folk with a particular care signature crossing an age boundary.

              I guess this raises the question – should you be allowed to keep copies of data year on year? Or are you ONLY allowed to use only a single dump in an analysis?

              The other issue this raises for me is – did the HES + ACORN + MOSAIC data make it too easy to try a deanonymisation attack? I don’t have a feel for the numbers of people in the different MOSAIC levels in a postcode region – that could be quite interesting to know even approximately to support back of the envelope/rule of thumb calculations about whether you may give too much away by releasing data JOINed with it…?

  2. Another bit – Institute and Faculty of Actuaries “Telegraph article rebuttal” http://www.actuaries.org.uk/news/press-releases/articles/telegraph-article-rebuttal /via @siascommittee

    “The research used anonymised data from the NHS that was available to organisations looking to further critical illness research. Individuals cannot be recognised from this data. The source data for this research was not made available by us to our membership or to other organisations, our analysis of this data is.”

    So this is sort of in-line with the idea presented in the committee enquiry today that a central body could accept queries and run analyses obo a third party and then just present them with the results, even though in this case the third party ran at least some of their own queries on the data. That said, the rebuttal does not clarify how or where the annotation of the HES data with ACORN/MOSAIC data occurred?

    1. Yes, I read that and I agree that it is still ambiguous about how the linking was done. I think that the IFOA explanation is like many researchers… I’ll give you the results but not the data. So the same thing would have applied if say an OU academic department had got this HES dataset.

      1. Yes – standard research practice – and arguably quite right that certain data shouldn’t be freely available to unaccredited or unlicensed parties. Though it does make for harder verification/review…

        It does bring in to focus question of who needs access to the data and who needs access to the results? If, for example, you can see database schemas, database queries, and have a test/dummy dataset to test the queries on, do the researchers actually need access to the database?

        1. Yes some have suggested that… why should any 3rd parties get direct access to the data? No fishing trips! Come up with your protocol!

  3. HES was linked to CACI and Experian data by the Information Centre using full postcode. The working party did not receive any identifiable data.

    Applications for non-identifiable data are not reviewed by a committee but by a person in the information governance team. The people ‘officially’ approving it (signing the contract for the IC) are the IG director on behalf of the Caldicott guardian and the commercial director.

      1. 1st part is deduced because there is no CAG approval for actuary org to have postcode. 2nd part I know because I have had a number of reuse agreements with the IC and they say that on them. All email during the process to apply for data are from people in the IG team (I assume (hope?)). On another project we needed daag approval which didn’t happen previously.

        1. This data seems more like pseudonymous than non-dentifiable data to be. Suggestions that it should have had DAAG approval and didn’t. This is the issue… and maybe why hints that ‘rules broken’?

  4. Hi Tony,

    Here’s a take on the ‘NHS data release to actuaries’ from a ‘conduct / insurance’ perspective – http://ethicsandinsurance.info/2014/02/27/insurers-privacy-nhs/

    About your ‘bins’ (aka categorisation) – the insurance world has always relied on categorisations, although the test-achats case vis gender showed that it can be used inappropriately. Indeed, without risk categories, there would be no risk pooling and without risk pooling, there would be no insurance. So bins are not per se unethical, but used lazily, inappropriately or for even worse reasons, they can be. The worse case scenario is red-lining, which the US insurance industry got knocked on a few decades ago. IMHO, some of the ways in which geo-demographic segmentation is constructed is highly questionable and could put users at risk from unethical segmentation.

    Cheers
    Duncan

    1. Hi Duncan –

      Thanks for that link – interesting. As the post suggests, it’s an ethical minefield, and techniques and opportunities for enriching and then deanonymising data get more widespread, and refined, as each day passes…

      In part, I was using ‘binning’ in quite a loose and general sense, to mean partitioning the data into smaller subsets. eg binning folk into age range bands, or geodem segments.

      What also concerns me is the way that groups are constructed, then used as the basis for model building, then used for decisionmaking around people later classed as being a member of such a group.

      I think this area is one often overlooked by the knee-jerk ‘privacy invasion’ responses:

      1) if there are associations made with a group and I am put into a group, it may reveal things about me that I haven’t personally disclosed. One tool I play with looks at folk commonly followed by a sample of followers of a person on Twitter. A look at the resulting map around my Twitter account identifies areas of interest in opendata, journalism, the OU, the Isle of Wight, libraries and F1. If it so happened that my followers also tend to like X, I might be classed as liking X too.

      2)I may be placed into a particular grouping whose associated model means that I am advertised particular goods, offered particular services (or not offered other services) as a result, etc. This frame of reference may limit my choices as a result which may or may not be a good thing. If I am ‘incorrectly’ assigned to that group, the effect of being placed in that frame of reference may additionally lead to other, inappropriate decisions being made about me.

      Sure I agree about the importance of pooling, but I also thing there is a wider issue of just trying to make sense of how pooling happens, how models are developed over pools, how assignment to pools works, and what the consequences are of: a) being pooled; b) being inappropriately pooled. At a consumer level there is then the question of: how do I work I which pool I’m in, how do I check it’s the ‘right’ one and how do I get it changed if I’m incorrectly classified.

      Just in passing, when it comes to the insurance industry, there is also the question of whether pooling is used to spread risk across groups in a way that premiums reflect risk, or whether the premiums spread the burden so higher risk groups are effectively subsidised by lower risk groups. I guess an example here may be car insurance premiums and gender?

      (Btw, I’m on shakey ground throughout the whole of this post, and the comment thread; in part I’m airing my confusion publicly as I try to make sense of all of this stuff!;-)

Comments are closed.