Reflections on (Government) (Big) Data Use…

Some thoughts scribbled down on my way home from a Policy Exchange workshop on “Big Data in Gov” earlier today, in which I start trying to unpack some of the confusion I have about what the open data and data driven government thing is all about…

When asked about challenges around use of personal data for government or commercial use, it’s easy to fall into the trap of putting privacy concerns at the top of the list and leave it at that. So here are some of the assumptions and beliefs I tend to bundle into the “privacy concerns and fears” bucket:

confidentiality: when folk talk about breaches of online privacy, I suspect they’re actually concerned about a loss of confidentiality;

– associated with confidentiality is selective revelation, or the belief that we should not have to divulge certain sorts of information to anyone who asks for it, or that if we do, it will be in confidence and subject to informed consent about how that data will be used.

Relating to these on social networks in particular are notions surrounding recovery from inappropriate disclosure (such as deletion of content), whether on the first part (someone posts something they want to detract), the second part (a “friend” makes a disclosure the first party would prefer had not been made, such as wishing them a happy birthday and revealing their birthdate) or the third part (where someone who isn’t a “friend” of the first party makes a disclosure about the first party) (see for example Personal Declarations on Your Behalf – Why Visiting One Website Might Tell Another You Were There).

In part, I suspect there is often a tacit assumption that there are safeguards on how data is collected and shared (e.g. as regulated by the Data Protection Act (for a quick overview, see ICO guidance on DPA)) but that the majority of folk (myself included!) are actually more than a little hazy about what the law actually stipulates… I also suspect that folk do not generally know what data large companies have collected about them, or the purposes to which that data is put. Add to this concerns about the buying, selling, aggregatation and disaggregation of personal data as part of the business of going concerns or for example as companies themselves are bought and sold, maybe even for the data they hold.

loss of anonymity: privacy as anonymity, or at least, the right to limit knowledge of your actions to a specific, limited public or with confidence that you will not be recognised outside that limited public. When different data sets can be cross-referenced or reconciled with each other, that data graph can become an unexpected public witness (“with the graph as my witness”!)

“creepiness” (or, “how did they know that….?”): this may be thought of as a form of invasion of privacy, in which your personal data may be processed in such a way that it triggers an action that appears to you to breach a confidence you did not knowingly or intentionally share.

“potential for evil”: to what extent might your data be used against you (i.e. to your detriment rather than your benefit)? In part, this may relate to uninformed consent, or use of data without, or against, consent, but it also admits of the ways in which data released for one purpose may come to be cross-referenced with other datasets which in turn and as a result may then be used against you.

equitability: if, as in Norway, your tax affairs are made public (via the skatteliste (“tax list”)) but the tax affairs of your neighbour aren’t, you might feel as if that were rather unfair. And if personal tax affairs are public, should corporate tax dealings be public to the same extent too (for example, if were were to compare the dealings of a sole trader operating under a personal tax regime, compared to their neighbour who set up a limited company or similar to operate a similar business under a “private” corporate tax regime?).

One of the other things we discussed was the extent to which personalisation might feature in the way government deals with its citizens. Part of the brief was to try to pay heed to how waste could be reduced and fraud either detected or prevented. Though not having much evidence to hand to base this on, it seems to me that part of the role of personalisation might be to identify services and benefits that meet the needs of a particular user profile, and try to more efficiently allocate services or resources to people who are eligible for them (at least in the sense of making the citizen aware of their entitlements). In the case of tax, it struck me that a good accountant essentially personalises the tax affairs of their client to maximise the benefit to the client, which got me wondering about the extent to which a personalised HMRC dashboard might tell me how to fill in my tax return for maximum efficiency…! And that if such a dashboard was pointing out to citizens the various loopholes and workarounds they could employ to minimise their tax spend, those loopholes would presumably get fixed pretty quickly…

As far as waste goes, making sure people claim things to which they are entitled, rather than things to which they are not entitled, presumably saves time and cost in processing those ineligible requests and minimises opportunities for misallocation through that route. Reducing friction (such as reducing the number of times or number of places in which a user needs to enter the same personal data), and increasing fluidity (for example, by allowing government services to share data elements, such as the DVLA “borrowing” (with your permission) your photograph from the Passport Office for your driving licence) can also serve to reduce duplicated processes and the potential for error (and hence cost, as well as the opportunity for fraud) that occurs in such cases.

In terms of fraud, this may in part be seen as a deliberate attempt to create a profile that is not a true one but that is eligible for benefits or services that are not directed at the true profile. One way of mitigating against such attempts at fraud might then be to find means by which creating false profiles for the purposes of fraud trigger graph conflicts that can be used to signal the fraud or deception.

As far as big data in government goes, I’m not sure we touched on it much at all. I do wonder, though, about the extent to which government could – or should – buy big data from corporates. Census data may be plugged in to databases held by companies such as Experian, but how much does it add? And how richer, and more current, are Experian’s datasets than the census reports? Google famously used search behaviours to identify flu trends, and I suspect that the supermarkets have a pretty good model of how calories, food groups and medecines are purchased, if not consumed, at a local level, which presumably feeds (no pun intended) into local health trends over a variety of timescales. And as far as traffic monitoring goes? I suspect that the mobile phone network operators have access to a far more comprehensive, up-to-date and even realtime models of pedestrian, as well as traffic, flows than the government does…

One of the things that big data, wherever it’s produced, can benefit from is some form of scaffolding that provides a basis around which normalisation can occur. In the education sector, getting a normalised catalogue of course codes is one example, although this is something that UCAS still appears unwilling to release as open data, let alone free open data.

And as far as open data goes, I think a good reason for opening up data is that it allows innovations from the inside, outside. Which is to say, developers working inside government may be tied to using legacy systems and processes, but if the data is open and public, there is nothing to stop them building more efficient implementations outside government, demonstrate their benefits, then bring them back within government….