Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)
We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).
If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.
One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.
One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.
Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.
If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).
Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.
As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.
OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.
Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.
PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.