OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Archive for the ‘Anything you want’ Category

First Baby Steps in Anonymising Data With Open Refine

leave a comment »

Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)

We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).

openrefineHash

If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.

One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.

One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.

Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.

If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).

operefinesalt

Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.

openrefinesalthash

As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.

OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.

Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.

PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.

Written by Tony Hirst

January 23, 2015 at 7:18 pm

Grayson Perry, Reith Lectures/Playing to the Gallery

leave a comment »

In 2013, Grayson Perry delivered the BBC Reith Lectures as a series of talks entitled Playing to the Gallery, since made available in book form.

Here are some quotes (taken from the book) that made sense to me in the context of process…

To begin with:

[Art’s] most important role is to make meaning. [p111]

So is what you’re doing meaningful to you?

As an artist, the ability to resist peer pressure, to trust one’s own judgement, is vital but it can be a lonely and anxiety-inducing procedure. [p18]

You can work ideas…

It’s … a great joy to learn a technique, because as soon as you learn it, you start thinking in it. When I learn a new technique my imaginative possibilities have expanded. Skills are really important to learn; the better you get at a skill, the more you have confidence and fluency. [p122]

Artists have historically worked technology…

In the past artists were the real innovators of technology. [p100]

…but can we work technology, as art?

The metaphor that best describes what it’s like for my practice as an artist is that of a refuge, a place inside my head where I can go on my own and process the world and its complexities. It’s an inner shed in which I can lose myself. [p131]

I said the final farewell to a good friend and great social technology innovator, Ches Lincoln, who died on Christmas Eve, last week. In one project we worked together on, an online learning space, we created a forum called “The Shed” for the those learners who wanted to get really geeky and techie in their questions and discussions, and whose conversations risked scaring off the majority.

And finally…

The sound a box of Lego makes is the noise of a child’s mind working, looking for the right piece. [p116]

Perfect…

Written by Tony Hirst

January 18, 2015 at 7:27 pm

Posted in Anything you want

Tagged with

The Camera Only Lies

leave a comment »

For some reason, when I first saw this:

Google Word Lens iPhone GIF

it reminded me of Hockney. The most recent thing I’ve read, or heard, about Hockney was a recent radio interview (BBC: At Home With David Hockney), in the opening few minutes of which we hear Hockney talking about photographs, how the chemical period of photography ,and its portrayal of perspective, is over, and how a new age of digital photography allows us to reimagine this sort of image making.

Hockney also famously claimed that lenses have played an important part in the craft of image making Observer: What the eye didn’t see…. Whether or not people have used lenses to create particular images, the things seen through lenses and the idea of how lenses may help us see the world differently may in turn influence the way we see other things, either in the mind’s eye, in where we focus attention, or in how we construct or compose an image of our own making.

As the above animated image shows, when a lens is combined with a logical transformation (that may be influenced by a socio-cultural and/or language context acting as a filter) of the machine interpretation of a visual scene represented as bits, it’s not exactly clear what we do see…!

As Hockney supposedly once said, “[t]he photograph isn’t good enough. It’s not real enough”.

Written by Tony Hirst

January 14, 2015 at 10:54 am

Posted in Anything you want

Fragments – Wikipedia to Markdown

leave a comment »

I’ve been sketching some ideas, pondering the ethics of doing an F1 review style book blending (openly licensed) content from Wikipedia race reports with some of my own f1datajunkie charts, and also wondering about the extent to which I could automatically generate Wikipedia style race report sentences from the data; I think the sentence generation, in general should be quite easy – the harder part would be identifying the “interesting” sentences (that is, the ones that make it into the report, rather than than the totality of ones that could be generated).

So far, my sketches have been based around just grabbing the content from Wikipedia, and transforming to markdown, the markup language used in the Leanpub workflow:

In Python 3.x at least, I came across some encoding issues, and couldn’t seem to identify Wikipedia page sections. For what it’s worth, a minimal scribble looks something like this:

!pip3 install wikipedia
import wikipedia

#Search for page titles on Wikipedia
wikipedia.search('2014 Australian grand prix')

#Load a page
f1=wikipedia.page('2014 Australian Grand Prix')

#Preview page content
f1.content

#Preview a section's content by section header
f1.section('Qualifying')
##For some reason, f1.sections shows an empty list for me?


#pandoc supports Wikimedia to markdown conversion
!apt-get -y install pandoc
!pip3 install pypandoc
import pypandoc

#To work round encoding issues, write the content to a file and then convert it...
f = open('delme1.txt', 'w', encoding='utf8')
f.write(f1.content)
f.close()

md=pypandoc.convert('delme1.txt', 'md', format='mediawiki')

If the Formula One race report pages follow similar templates and use similar headings, then it should be straightforward enough to pull down sections of the reports and interleave them with charts and tables. (As well as issues parsing out section headers to fill the sections list, the tables on the page don’t appear to be grabbed into the .content field (assuming the API wrapper does manage to grab that content down? However, I can easily recreate those from things like the ergast API).

Looking at the construction of sentences in the race reports, many of them are formulaic. However, as noted above, generating sentences is one thing, but generating interesting sentences is another. For that, I think we need to identify sets of rules that mark data features out as interesting or not before generating sentences from them.

Written by Tony Hirst

December 30, 2014 at 11:46 pm

Posted in Anything you want

Intellectual Talisman, Baudelaire

leave a comment »

During tumultuous times there is often an individual, an intellectual talisman if you like, who watches events unfold and extracts the essence of what is happening into a text, which then provides a handbook for the oppressed. For the frustrated Paris-based artists battling with the Academy during the second half of the nineteenth century, Baudelaire was that individual, his essay, The Painter of Modern Life, the text.

… He claimed that ‘for the sketch of manners, the depiction of bourgeois life … there is a rapidity of movement which calls for an equal speed of execution of the artist’. Sound familiar? The essay goes on to feature several references to the word ‘flâneur‘, the concept of a man-about-town, which Buudelaire was responsible for bringing to the public’s attention, describing the role thus: ‘Observer, philosopher, flâneur – call him what you will… the crowd is his element, as the air is that of the birds and water of fishes. His passion and his profssion are to become one flesh with the crowd. For the perfect flâneur, for the passionate spectator, it is an immense joy to set up house in the heart of the multitude, amid the ebb and flow of movement, in the midst of the fugitive and the infinite.’

There was no better provocation for the Impressionists to go out and paint en plein air. Baudelaire passionately believed that it was incumbent upon living artists to document their time…

And the way to do that was by immersing oneself in the day-to-day of metroplitan living: watching, thinking, feeling and finally recording. …”

Will Gompertz, What Are You Looking At? 150 Years of Modern Art in the Blink of an Eye pp. 28-29

Written by Tony Hirst

December 19, 2014 at 11:01 am

Posted in Anything you want

A Briefest of Looks at the REF 2014 Results Data – Reflection on a Data Exercise

leave a comment »

At a coursemodule team meeting for the new OU data course […REDACTED…], which will involve students exploring data sets that we’ve given them, as well as ones that they will hopefully find for themselves, it was mentioned that we really should get an idea about how the the exercises we’ve written, are writing, and have yet to write will take students to do.

In that context, I noticed that the UK Higher Education Research Excellence Framework, 2014 results were out today, so I gave myself an hour to explore the data, see what’s there, and get an idea for some of the more obvious stories we might try to pull out.

Here’s as far as I got: an hour long conversation from a standing start with the REF 2014 data.

Although I did have a couple of very minor interruptions, I didn’t get as far as I’d expected/hoped.

So here are a few reflections, as well as some comments on the constraints I put myself under:

  • the students will be working in a headless virtual machine we have provided them with; we don’t make any requirements of students to have access to a spreadsheet application; OpenRefine runs on the VM, so that could be used to get a preview of the spreadsheet (I’m not sure how well OpenRefine copes with multiple sheets, if a spreadsheet does contain multiple sheets?); given all that, I thought I’d try to explore the notebook purely within an IPython notebook, without (as @andysc put it) eyeballing the data in a spreadsheet first;
  • I didn’t really read any of the REF docs, so I wasn’t really sure how the data would be reported. I’m not sure how much time it would have taken to read up on the reporting data used, or what sort of explanatory notes and/or external metadata are provided?
  • I had no real idea what questions to ask or reports to generate. “League tables” was an obvious one, but calculated how? You need to know what numbers are available in the data set and how they may (or may not) relate to each other to start down that track. I guess I could have looked at distributions down a column, and then grouped in different ways, and then started to look for and identify the outliers, at least as visually revealed.
  • I didn’t do any charts at all. I had it half in mind to do some dodged bar charts eg within an institution to show how the different profiles were scored within each unit of assessment for a given institution, but ran out of time before I tried that. (I couldn’t remember offhand what sort of shape the data needs to be in to plot that sort of chart, and then wasted a minute or two, gardener’s foot on fork staring into the distance, pondering what we could do if I cast (unmelted) the separate profile data into different columns for each return), but then decided it’d use up too much of my hour trying to remember/look-up how to do that, let alone then trying to make up stuff to do with the data once it was in that form.
  • The exploration/conversation demonstrated grouping, sorting and filtering, though I didn’t limit the range of columns displayed. I did use a few cribs, both from the pandas online documentation, and referenced other notebooks we have drafted for student use (eg on how to generate sorted group/aggregate reports on dataframe).
  • our assessment will probably mark folk down for not doing graphical stuff… so I’d have lost marks on not just putting a random chart in, such as a bar chart counting numbers of institutions by unit of assessment;
  • I didn’t generate any derived data – again, this is something we’d maybe mark students down on; an example I saw just now in the OU internal report on the REF results is GPA – grade point average. I’m not sure what it means, but while in the data I was wondering whether I should explore some function of the points (eg 4 x (num x 4*) + 3 x (num x 3*) … etc) or some function of the number of FTEs and the star rating results.
  • Mid-way through my hour, I noticed that Chris Gutteridge had posted the data as Linked Data; Linked Data and SPARQL querying is another part of the course, so maybe I should spend an hour seeing what I can do with that endpoint from a standing start? (Hmm.. I wonder – does the Gateway to Research also have a SPARQL endpoint?)
  • The course is about database management systems, in part, but I didn’t put the data into either PostgreSQL or MongoDB, which are the two systems we introduce students to, or discuss rationale for which db may have been a useful store for the data, extent to which normalisation was required, (eg taking the data to third normal form or wherever, and perhaps actually demonstrating that etc). (In the course, we’ll probably also show students how to generate RDF triples they can run their own SPARQL queries against.) Nor did I throw the dataframe in SQLite using pandasql, which would have perhaps made it easier (and quicker?) to write some of the queries using SQL rather than the pandas syntax?
  • I didn’t link in to any other datasets, which again is something we’d like students to be able to do. At the more elaborate end might have been pulling in data from something like Gateway to Research? A quicker hack might have been to try annotating the data with administrative information, which I guess can be pulled from one of the datasets on data.ac.uk?
  • I didn’t do any data joining or merging; again, I expect we’ll want students to be able to demonstrate this sort of thing in an appropriate way, eg as a result of merging data in from another source.
  • writing filler text (setting the context, explaining what you’re going to do, commenting on results etc) in the notebook takes time… (That is, the hour is not just taken up by writing the code/queries; and there is also time spent, but not seen, in coming up with questions to ask, as well as then converting them to queries and and then reading, checking and mentally interpreting the results.)

One thing I suggested to the course team was that we all spend an hour on the data and see what we come up with. Another thing that comes to mind is what might I now be able to achieve in a second hour; and then a third hour. (This post has taken maybe half an hour?)

Another approach might have been to hand off notebooks to each other, the second person building on the first’s notebook etc. (We’d need to think how that would work for time: would the second person’s hour start before – or after – reading the first person’s notebook?) This would in some way model providing the student with an overview of a dataset and then getting them to explore it further, giving us an estimate of timings based on how well we can build on work started by someone else, but still getting us to work under a time limit.

Hmmm.. does this also raise the possibility of some group exercises? Eg one person has to normalise the data and get it into PostgreSQL, someone else get some additional linkable data into the mix, someone to start generating summary textual reports and derived data elements, someone generating charts/graphical reports, someone exploring the Linked Data approach?

PS One other thing I didn’t look at, but that is a good candidate for all sorts of activity, would be to try to make comparisons to previous years. Requires finding and ordering previous results, comparing rankings, deciding whether rankings actually refer to similar things (and extent to which we can compare them at all). Also, data protection issues: could we identify folk likely to have been included in a return just from the results data, annotated with data from Gateway to Research, perhaps, or institutional repositories?

Written by Tony Hirst

December 18, 2014 at 1:59 pm

Sketching Scatterplots to Demonstrate Different Correlations

with 8 comments

Looking just now for an openly licensed graphic showing a set of scatterplots that demonstrate different correlations between X and Y values, I couldn’t find one.

[UPDATE: following a comment, Rich Seiter has posted a much cleaner – and general – method here: NORTA Algorithm Examples; refer to that post – rather than this – for the method…(my archival copy of rseiter’s algorithm)]

So here’s a quick R script for constructing one, based on a Cross Validated question/answer (Generate two variables with precise pre-specified correlation):

library(MASS)

corrdata=function(samples=200,r=0){
  data = mvrnorm(n=samples, mu=c(0, 0), Sigma=matrix(c(1, r, r, 1), nrow=2), empirical=TRUE)
  X = data[, 1]  # standard normal (mu=0, sd=1)
  Y = data[, 2]  # standard normal (mu=0, sd=1)
  data.frame(x=X,y=Y)
}

df=data.frame()
for (i in c(1,0.8,0.5,0.2,0,-0.2,-0.5,-0.8,-1)){
  tmp=corrdata(200,i)
  tmp['corr']=i
  df=rbind(df,tmp)
}

library(ggplot2)

g=ggplot(df,aes(x=x,y=y))+geom_point(size=1)
g+facet_wrap(~corr)+ stat_smooth(method='lm',se=FALSE,color='red')

And here’s an example of the result:

scatterCorr

It’s actually a little tidier if we also add in + coord_fixed() to fix up the geometry/aspect ratio of the chart so the axes are of the same length:

scatterCorrSquare

So what sort of OER does that make this post?!;-)

PS methinks it would be nice to be able to use different distributions, such as a uniform distribution across x. Is there a similarly straightforward way of doing that?

UPDATE: via comments, rseiter (maybe Rich Seiter?) suggests the NORmal-To-Anything (NORTA) algorithm (about, also here). I have no idea what it does, but here’s what it looks like!;-)

//based on http://blog.ouseful.info/2014/12/17/sketching-scatterplots-to-demonstrate-different-correlations/#comment-69184
#The NORmal-To-Anything (NORTA) algorithm
library(MASS)
library(ggplot2)

#NORTA - h/t rseiter
corrdata2=function(samples, r){
  mu <- rep(0,4)
  Sigma <- matrix(r, nrow=4, ncol=4) + diag(4)*(1-r)
  rawvars <- mvrnorm(n=samples, mu=mu, Sigma=Sigma)
  #unifvars <- pnorm(rawvars)
  unifvars <- qunif(pnorm(rawvars)) # qunif not needed, but shows how to convert to other distributions
  print(cor(unifvars))
  unifvars
}

df2=data.frame()
for (i in c(1,0.9,0.6,0.4,0)){
  tmp=data.frame(corrdata2(200,i)[,1:2])
  tmp['corr']=i
  df2=rbind(df2,tmp)
}
g=ggplot(df2,aes(x=X1,y=X2))+geom_point(size=1)+facet_wrap(~corr)
g+ stat_smooth(method='lm',se=FALSE,color='red')+ coord_fixed()

Here’s what it looks like with 1000 points:

unifromScatterCorr

Note that with smaller samples, for the correlation at zero, the best fit line may wobble and may not have zero gradient, though in the following case, with 200 points, it looks okay…

unifscattercorrsmall

The method breaks if I set the correlation (r parameter) values to less than zero – Error in mvrnorm(n = samples, mu = mu, Sigma = Sigma) : ‘Sigma’ is not positive definite – but we can just negate the y-values (unifvars[,2]=-unifvars[,2]) and it seems to work…

If in the corrdata2 function we stick with the pnorm(rawvars) distribution rather than the uniform (qunif(pnorm(rawvars))) one, we get something that looks like this:

corrnorm1000

Hmmm. Not sure about that…?

Written by Tony Hirst

December 17, 2014 at 1:24 pm

Posted in Anything you want, Rstats

Tagged with

Follow

Get every new post delivered to your Inbox.

Join 1,290 other followers