Archive for the ‘privacy’ Category
So it seems that in a cost-recovered data release that was probably lawful then but possibly wouldn’t be now* – Hospital records of all NHS patients sold to insurers – the
Staple Inn Actuarial Society Critical Illness Definitions and Geographical Variations Working Party (of what, I’m not sure? The Institute and Faculty of Actuaries, perhaps?) got some Hospital Episode Statistics data from the precursor to the HSCIC, blended it with some geodemographic data**, and then came to the conclusion that “that the use of geodemographic profiling could refine Critical illness pricing bases” (source: Extending the Critical Path), presenting the report to the Staple Inn Actuarial Society who also headline branded the PDF version of the report? Maybe?
* House of Commons Health Committee, 25/2/14: 15.59:32 for a few minutes or so; that data release would not be approved now: 16.01:15 reiterated at 16.03:05 and 16.07:05
** or maybe they didn’t? Maybe the data came pre-blended, as @amcunningham suggests in the comments? I’ve added a couple of further questions into my comment reply… – UPDATE: “HES was linked to CACI and Experian data by the Information Centre using full postcode. The working party did not receive any identifiable data.”
CLARIFICATION ADDED (source )—-
“In a story published by the Daily Telegraph today research by the IFoA was represented as “NHS data sold to insurers”. This is not the case. The research referenced in this story considered critical illness in the UK and was presented to members of the Staple Inn Actuarial Society (SIAS) in December 2013 and was made publically available on our website.
“The IFoA is a not for profit professional body. The research paper – Extending the Critical Path – offered actuaries, working in critical illness pricing, information that would help them to ask the right questions of their own data. The aim of providing context in this way is to help improve the accuracy of pricing. Accurate pricing is considered fairer by many consumers and leads to better reserving by insurance companies.
There was also an event on 17 February 2014.
Via a tweet from @SIAScommittee, since deleted for some reason(?), this is clarified further: “SIAS did not produce the research/report.”
The branding that mislead me – I must not be so careless in future…
Many of the current agreements about possible invasions of privacy arising from the planned care.data release relate to the possible reidentification of individuals from their supposedly anonymised or pseudonymised health data (on my to read list: NHS England – Privacy Impact Assessment: care.data) but to my mind the
SIAS report presented to the SIAS suggests that we also need to think about consequences of the ways in which aggregated data is analysed and used (for example, in the construction of predictive models). Where aggregate and summarised data is used as the basis of algorithmic decision making, we need to be mindful that sampling errors, as well as other modelling assumptions, may lead to biases in the algorithms that result. Where algorithmic decisions are applied to people placed into statistical sampling “bins” or categories, errors in the assignment of individuals into a particular bin may result in decisions being made against them on an incorrect basis.
Rather than focussing always on the ‘can I personally be identified from the supposedly anonymised or pseudonymised data’, we also need to be mindful of the extent to, and ways in, which:
1) aggregate and summary data is used to produce models about the behaviour of particular groups;
2) individuals are assigned to groups;
3) attributes identified as a result of statistical modelling of groups are assigned to individuals who are (incorrectly) assigned to particular groups, for example on the basis of estimated geodemographic binning.
What worries me is not so much ‘can I be identified from the data’, but ‘are there data attributes about me that bin me in a particular way that statistical models developed around those bins are used to make decisions about me’. (Related to this are notions of algorithmic transparency – though in many cases I think this must surely go hand in hand with ‘binning transparency’!)
That said, for the personal-reidentification-privacy-lobbiests, they may want to pick up on the claim in the
SIASIFoA report (page 19) that:
In theory, there should be a one to one correspondence between individual patients and HESID. The HESID is derived using a matching algorithm mainly mapped to NHS number, but not all records contain an NHS number, especially in the early years, so full matching is not possible. In those cases HES use other patient identifiable fields (Date of Birth, Sex, Postcode, etc.) so imperfect matching may mean patients have more than one HESID. According to the NHS IC 83% of records had an NHS number in 2000/01 and this had grown to 97% by 2007/08, so the issue is clearly reducing. Indeed, our data contains 47.5m unique HESIDs which when compared to the English population of around 49m in 1997, and allowing for approximately 1m new lives a year due to births and inwards migration would suggest around 75% of people in England were admitted at least once during the 13 year period for which we have data. Our view is that this proportion seems a little high but we have been unable to verify that this proportion is reasonable against an independent source.
Given two or three data points, if this near 1-1 correspondence exists, you could possibly start guessing at matching HESIDs to individuals, or family units, quite quickly…
– ACORN (A Classification of Residential Neighbourhoods) is CACI’s geodemographic segmentation system of the UK population. We have used the 2010 version of ACORN which segments postcodes into 5 Categories, 17 Groups and 57 Types.
– Mosaic UK is Experian’s geodemographic segmentation system of the UK population. We have used the 2009 version of Mosaic UK which segments postcodes into 15 Groups and 67 Household Types.
The ACORN and MOSAIC data sets seem to provide data at the postcode level. I’m not sure how this was then combined with the HES data, but it seems the
SIASIFoA folk found a way (p 29) [or as Anne-Marie Cunningham suggests in the comments, maybe it wasn’t combined by SIASIFoA – maybe it came that way?]:
The HES data records have been encoded with both an ACORN Type and a Mosaic UK Household Type. This enables hospital admissions to be split by ACORN and Mosaic Type. This covers the “claims” side of an incidence rate calculation. In order to determine the exposure, both CACI and Experian were able to provide us with the population of England, as at 2009 and 2010 respectively, split by gender, age band and profiler.
This then represents another area of concern – the extent to which even pseudonymised data can be combined with other data sets, for example based on geo-demographic data. So for example, how are the datasets actually combined, and what are the possible consequences of such combinations? Does the combination enrich the dataset in such a way that makes it easier for use to deanonymise either of the original datasets (if that is your primary concern); or does the combination occur in such a way that it may introduce systematic biases into models that are then produced by running summary statistics over groupings that are applied over the data, biases that may be unacknowedged (to possibly detrimental effect) when the models are used for predictive modelling, pricing models or as part of policy-making, for example?
Just by the by, I also wonder:
– what data was released lawfully under the old system that wouldn’t be allowed to be released now, and to whom, and for what purpose?
– are the people to whom that data was released allowed to continue using and processing that data?
– if they are allowed to continue using that data, under what conditions and for what purpose?
– if they are not, have they destroyed the data (16.05:44), for example by taking a sledgehammer to the computers the data was held on in the presences of NHS officers, or by whatever other means the state approves of?
Via @wilm, I notice that it’s time again for someone (this time at the Wall Street Journal) to have written about the scariness that is your Google personal web history (the sort of thing you probably have to opt out of if you sign up for a new Google account, if other recent opt-in by defaults are to go by…)
It may not sound like much, but if you do have a Google account, and your web history collection is not disabled, you may find your emotional response to seeing months of years of your web/search history archived in one place surprising… Your Google web history.
Not mentioned in the WSJ article was some of the games that the Chrome browser gets up. @tim_hunt tipped me off to a nice (if technically detailed, in places) review by Ilya Grigorik of some the design features of the Chrome browser, and some of the tools built in to it: High Performance Networking in Chrome. I’ve got various pre-fetching tools switched off in my version of Chrome (tools that allow Chrome to pre-emptively look up web addresses and even download pages pre-emptively*) so those tools didn’t work for me… but looking at chrome://predictors/ was interesting to see what keystrokes I type are good predictors of web pages I visit…
* By the by, I started to wonder whether webstats get messed up to any significant effect by Chrome pre-emptively prefetching pages that folk never actually look at…?
In further relation to the tracking of traffic we generate from our browsing habits, as we access more and more web/internet services through satellite TV boxes, smart TVs, and catchup TV boxes such as Roku or NowTV, have you ever wondered about how that activity is tracked? LG Smart TVs logging USB filenames and viewing info to LG servers describes not only how LG TVs appear to log the things you do view, but also the personal media you might view, and in principle can phone that information home (because the home for your data is a database run by whatever service you happen to be using – your data is midata is their data).
there is an option in the system settings called “Collection of watching info:” which is set ON by default. This setting requires the user to scroll down to see it and, unlike most other settings, contains no “balloon help” to describe what it does.
At this point, I decided to do some traffic analysis to see what was being sent. It turns out that viewing information appears to be being sent regardless of whether this option is set to On or Off.
you can clearly see that a unique device ID is transmitted, along with the Channel name … and a unique device ID.
This information appears to be sent back unencrypted and in the clear to LG every time you change channel, even if you have gone to the trouble of changing the setting above to switch collection of viewing information off.
It was at this point, I made an even more disturbing find within the packet data dumps. I noticed filenames were being posted to LG’s servers and that these filenames were ones stored on my external USB hard drive.
Hmmm… maybe it’s time I switched out my BT homehub for a proper hardware firewalled router with a good set of logging tools…?
PS FWIW, I can’t really get my head round how evil on the one hand, or damp squib on the other, the whole midata thing is turning out to be in the short term, and what sorts of involvement – and data – the partners have with the project. I did notice that a midata innovation lab report has just become available, though to you and me it’ll cost 1500 squidlly diddlies so I haven’t read it: The midata Innovation Opportunity. Note to self: has anyone got any good stories to say about TSB supporting innovation in micro-businesses…?
PPS And finally, something else from the Ilya Grigorik article:
The HTTP Archive project tracks how the web is built, and it can help us answer this question. Instead of crawling the web for the content, it periodically crawls the most popular sites to record and aggregate analytics on the number of used resources, content types, headers, and other metadata for each individual destination. The stats, as of January 2013, may surprise you. An average page, amongst the top 300,000 destinations on the web is:
– 1280 KB in size
– composed of 88 resources
– connects to 15+ distinct hosts
Is it any wonder that pages take so long to load on a mobile phone off the 3G netwrok, and that you can soon eat up your monthly bandwidth allowance!
A BIS Press Release (Next steps making midata a reality) seems to have resulted in folk tweeting today about the #midata consultation that was announced last month. If you haven’t been keeping up, #midata is the policy initiative around getting companies to make “[consumer data] that may be actionable and useful in making a decision or in the course of a specific activity” (whatever that means) available to users in a machine readable form. To try to help clarify matters, several vignettes are described in this July 2012 report – Example applications of the midata programme – which plays the role of a ‘draft for discussion’ at the September midata Strategy Board [link?]. Here’s a quick summary of some of them:
- form filling: a personal datastore will help you pre-populate forms and provide certified evidence of things like: proof of her citizenship, qualified to drive, passed certain exams and achieved certain qualifications, passed a CRB check, and so on. (Note: I’ve previously tried to argue the case for the OU starting to develop a service (OU Qualification Verification Service) around delivering verified tokens relating to the award of OU degrees, and degrees awarded by the polytechnics, as was (courtesy of the OU’s CNAA Aftercare Service), but after an initial flurry of interest, it was passed on. midata could bring it back maybe?
- home moving admin: change your details in a personal “mydata” data store, and let everyone pick up the changes from there. Just think what fun you could have with an attack on this;-)
- contracts and warranties dashboard: did my crApple computer die the week before or after the guarantee ran out?
- keeping track of the housekeeping: bank and financial statement data management and reporting tools. I thought there already was software for doing this? do we use it though? I’d rather my bank improved the tools it provided me with?
- keeping up with the Jones’s: how does my house’s energy consumption compare with that of my neighbours?
- which phone? Pick a tariff automatically based on your actual phone usage. From going through this recently, the problem is not with knowing how I use my phone (easy enough to find out), it’s with navigating the mobile phone sites trying to understand their offers. (And why can’t Vodafone send me an SMS to say I’m 10 minutes away from using up this month’s minutes, rather than letting me go over? The midata answer might be an agent that looks at my usage info and tells me when I’m getting close to my limit, which requires me having access to my contract details in a machine readable form, I guess?
And here’s a BIS blog post summarising them: A midata future: 10 ways it could shape your choices.
(The #midata policy seems based on a belief that users want better access to data so they can do things with it. I’m not convinced – why should I have to export my bank data to another service (increasing the number of services I must trust) rather than my bank providing me with useful tools directly? I guess one way this might play out is that any data that does dribble out may get built around by developers who then sell the tools back to the data providers so they can offer them directly? In this context, I guess I should read the BIS commissioned Jigsaw Research report: Potential consumer demand for midata.)
Today has also seen a minor flurry of chat around the call for evidence on the Communications Data Bill, presumably because the closing date for responses is tomorrow (draft Communications Data Bill). (Related reading: latest Annual Report of the Interception of Communications Commissioner.) Again, if you haven’t been keeping up, the draft Communications Data Bill describes communications data in the following terms:
- Communications data is information about a communication; it can include the details of the time, duration, originator and recipient of a communication; but not the content of the communication itself
- Communications data falls into three categories: subscriber data; use data; and traffic data.
The categories are further defined in an annex:
- Subscriber Data – Subscriber data is information held or obtained by a provider in relation to persons to whom the service is provided by that provider. Those persons will include people who are subscribers to a communications service without necessarily using that service and persons who use a communications service without necessarily subscribing to it. Examples of subscriber information include:
– ‘Subscriber checks’ (also known as ‘reverse look ups’) such as “who is the subscriber of phone number 012 345 6789?”, “who is the account holder of e-mail account email@example.com?” or “who is entitled to post to web space http://www.xyz.anyisp.co.uk?”;
– Subscribers’ or account holders’ account information, including names and addresses for installation, and billing including payment method(s), details of payments;
– information about the connection, disconnection and reconnection of services which the subscriber or account holder is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the provision to a subscriber or account holder of forwarding/redirection services;
– information about apparatus used by, or made available to, the subscriber or account holder, including the manufacturer, model, serial numbers and apparatus codes.
– information provided by a subscriber or account holder to a provider, such as demographic information or sign-up data (to the extent that information, such as a password, giving access to the content of any stored communications is not disclosed).
- Use data – Use data is information about the use made by any person of a postal or telecommunications service. Examples of use data may include:
– itemised telephone call records (numbers called);
– itemised records of connections to internet services;
– itemised timing and duration of service usage (calls and/or connections);
– information about amounts of data downloaded and/or uploaded;
– information about the use made of services which the user is allocated or has subscribed to (or may have subscribed to) including conference calling, call messaging, call waiting and call barring telecommunications services;
– information about the use of forwarding/redirection services;
– information about selection of preferential numbers or discount calls;
- Traffic Data – Traffic data is data that is comprised in or attached to a communication for the purpose of transmitting the communication. Examples of traffic data may include:
– information tracing the origin or destination of a communication that is in transmission;
– information identifying the location of equipment when a communication is or has been made or received (such as the location of a mobile phone);
– information identifying the sender and recipient (including copy recipients) of a communication from data comprised in or attached to the communication;
– routing information identifying equipment through which a communication is or has been transmitted (for example, dynamic IP address allocation, file transfer logs and e-mail headers – to the extent that content of a communication, such as the subject line of an e-mail, is not disclosed);
– anything, such as addresses or markings, written on the outside of a postal item (such as a letter, packet or parcel) that is in transmission;
– online tracking of communications (including postal items and parcels).
To put the communications data thing into context, here’s something you could try for yourself if you have a smartphone. Using something like the SMS to Text app (if you trust it!), grab your txt data from your phone and try charting it: SMS analysis (coming from an Android smartphone or an IPhone). And now ask yourself: what if I also mapped my location data, as collected by my phone? And will this sort of thing be available as midata, or will I have to collect it myself using a location tracking app if I want access to it? (There’s an asymmetry here: the company potentially collecting the data, or me collecting the data…)
It’s also worth bearing in mind that even if access to your data is locked down, access to the data of people associated with you might reveal quite a lot of information about you, including your location, as Adam Sadilek et al. describe: Finding Your Friends and Following Them to Where You Are (see also Far Out: Predicting Long-Term Human Mobility). My own tinkerings with emergent social positioning (looking at who the followers of particular twitter users also follow en masse) also suggest we can generate indicators about potential interests of a user by looking at the interests of their followers… Even if you’re careful about who your friends are, your followers might still reveal something about you you have tried not to disclose yourself (such as your birthday…). (That’s one of the problems with asymmetric trust models! Hmmm… could be interesting to start trying to model some of this… )
Both of these consultations provide a context for reflecting on the extent to which companies use data for their own processing purposes (for a recent review, see What happens to my data? A novel approach to informing users of data processing practices), the extent to which they share this data in raw and processed form with other companies or law enforcement agencies, the extent to which they may use it to underwrite value-added/data-powered services to users directly or when combined with data from other sources, the extent to which they may be willing to share it in raw or processed form back with users, and the extent to which users may then be willing (or licensed) to share that data with other providers, and/or combine it with data from other providers.
One of the biggest risks from a “what might they learn about me” point of view – as well as some of the biggest potential benefits – comes from the reconciliation of data from multiple different sources. Mosaic theory is an idea taken from the intelligence community that captures the idea that when data from multiple sources is combined, the value of the whole view may be greater than the sum of the parts. When privacy concerns are idly raised as a reason against the release of data, it is often suspicion and fears around what a data mosaic picture might reveal that act as drivers of these concerns. (Similar fears are also used as a reason against the release of data, for example under Freedom of Information requests, in case a mosaic results in a picture that can be used against national interests: eg D.E. Pozen, The Mosaic Theory, National Security, and the Freedom of Information Act and MP Goodwin, A National Security Puzzle: Mosaic Theory and the First Amendment Right of Access in the Federal Courts).
Note that within a particular dataset, we might also appeal to mosaic theory thinking; for example, might we learn different things when we observe individual data records as singletons, as opposed to a set of data (and the structures and patterns it contains) as a single thing: GPS Tracking and a ‘Mosaic Theory’ of Government Searches. And as a consequence, might we want to treat individual data records, and complete datasets, differently?
PS via this ORG post – Consulympics: opportunities to have your say on tech policies – which details a whole raft of currently open ICT related consultations in the UK, I am reminded of this ICO Consultation on the draft Anonymisation code of practice along with a draft of the anaoymisation code itself.
I rarely link social apps to other social apps, but sometimes I click through on the first through stages of the linking process to see what happens. Here’s an example I just tried using Klout, which wants me to link in to my account on Facebook. The screenshot is taken from Facebook… but what does it mean?
Does that horizontal arrow aligned with the first element mean permission is only being requested for my personal information? Or is that thin vertical line an “AND” that says persmission is being requested to access my personal information AND post to my wall AND etc etc…
I have no idea….?
[A story a few days ago (March 2012) brought this post to mind… Here’s the recent story – Walmart buys a Facebook-based calendar app to get a look at customers’ dates: “The Social Calendar app and its file of 110 million birthdays and other events, acquired from Newput Corp., will give Walmart the ability to expand its efforts to dig deeper into the lives of customers—allowing customers to make purchases on Walmart.com directly from event reminders from the Web or their mobile device.” It’s time I started brushing up on my legal understanding, I think: in the UK, would data protection legislation prevent one company from buying another for its data, and then using that data for a different reason to the reason for which it was collected? And if so, how is different defined? Could the data be used to annotate/be annotated by other data to create a derived product? Hmm… And how will #midata fit in with all this? eg We Can Haz Our Personal Data Back from Corporates?]
A long time ago, I wrote:
A couple of weeks ago [err, that’ll be years now;-)], I was telling a colleague about a podcast I’d heard earlier that day: Future Proofing Your Privacy. At the start of the talk, the speaker, Mark Hedland, tells of how he posted to an online group a post that said…
For those of you who haven’t followed the links, here’s a recap. Something that was posted over 10 years ago to a part of the web that wasn’t supposed to be being archived, was – and now Mark Hedland can show how foolish he was then in thinking that [what] he was saying then would disappear.
As we talked, my colleague [“Sam Smith”] mentioned how 5 or so years ago they had posted a request to a news group asking for a translation of a traditional, Canadian French folk song, a translation they have since lost, along with the name of the song. (Actually, it wasn’t a song, French or Canadian, but it was to do with translation; I have changed the specific details to protect my colleague’s privacy!)
Two minutes after leaving their office (or maybe it was three, certainly no longer than that) I mailed my colleague a link to a Google Groups search page containing their long lost post. The query used the equivalent of these search terms: translation song “sam smith”. The post being searched for was the third item in the list of search results.
And so, as Google continues to roll out its social circle search facilities and use the people you know (and the people they know) to inform what search results you see, [and as Google buys up other social search companies, such as Aardvark (e.g. Google Buys Human-driven Search Engine Aardvark: Will It Make It to the Main SERPS?)], it’s worth bearing in mind a few things:
1) Just because you haven’t given Google your Twitter details, Google may know you’re my friend becuase I have given Google my twitter details and my friends and followers lists are public (an ‘asymmetric disclosure’? So for example, for a symmetric disclosure, Google might only use the belief that we’re friends if I follow you AND you have given Google your Twitter credentials AND you follow me. But if it you uses you to inform my results simply because I follow you, that would be asymmetric?)
2) Just because you haven’t given Google any personal info, Google might buy a company you have disclosed personal information to and then assimilate it into their growing total information awareness… (You do know Google owns Youtube, don’t you, and so has a pretty good idea of everything you’ve watched on it?;-)
3) Your mum may be influencing your search results… And you might be influencing your kids’ results… ;-)
PS a not evil thing to do would be to give users of an acquired service a guaranteed period of grace between the announcement that company has been acquired and the time when Google first has access to personal data, with the guarantee that users can withdraw from the service within that period and have their records permanently deleted.
PPS what does Google know about you? Here are two things to try: if you have a Google account, see who’s in your social circle; and whether or not you have a Google account, see what Google’s social graph API can turn up about you… .
PPPS if you’re on Facebook, Twitter and LinkedIn, Mashed In provides a widget based tool for letting other people on those networks see how closely linked they are to you… The asymmetries might arise here from all over the place, depending on what Mashed In is actually doing (I’ll try to do some digging…). For example, you might log on to my site and see that you are connected to someone on Facebook who is connected to someone on Twitter who I’m connect to on Linked In. Those intermediaries, who maybe are trying to maintain privacy of a sort by having separate social circles on different networks, are suddenly exposed. Like weddings where guests from different parts of the happy couple’s life collide, your connections may b your undoing. (Hmmm, so I wonder, are all these social tools going to start being deployed on prospective MPs I wonder? Prospctiv Parliamentary Candidate X is only two steps away from both a member of an dodgy looking group on Facebook and an ex porn star, for example… MPs expenses could be as if nothing compared to the sorts of selective storytelling you might be abl to turn up as a result of friend of a friend connections. Think Twiangulate, but working over multiple servics (as Mashed In might do?), court records, local news searches, gossip sites, company directorships, etc etc… Nightmare…
PPPPS Not to self – do a post on this… Reidentification Using Social Networks (i.e. deanonymisation); for sample History attack code, see SocialHistory.js: See Which Sites Your Users Visit]