Having got my promotion case through the sub-Faculty level committee (with support and encouragement from senior departmental colleagues), it’s time for another complete rewrite to try to get it though the Faculty committee. Guidance suggests that it is not inappropriate – and may even be encouraged – for a candidate to include something about their academic philosophy, so here are some scribbled thoughts on mine…
One of the declared Charter objects (sic) of the Open University is "to promote the educational well-being of the community generally", as well as " the advancement and dissemination of learning and knowledge". Both as a full-time PhD student with the OU (1993-1997), and then as an academic (1999-), I have pursued a model of open practice, driven by the idea of learning in public, with the aim of communicating academic knowledge into, and as part of, wider communities of practice, modeling learning behaviour through demonstrating my own learning processes, and originating new ideas in a challengeable and open way as part of my own learning journey.
My interest in open educational resources is in part a subterfuge, driven by a desire that educators be more open in demonstrating their own learning and critical practices, including the confusion and misconceptions they grapple with along the way, rather than being seen simply as professors of some sort of inalienable academic truth.
My interest in short course development is based on the belief that for the University to contribute effectively to continued lifelong education and professional development, we need to have offerings that are at an appropriate level of granularity as well as academic level. Degrees represent only one - early part - of that journey. Learners are unlikely to take more than one undergraduate degree in their lifetime, but there is no reason why they should not continue to engage in learning throughout their life. Evidence from the first wave of MOOCs suggests that many participants in those courses were already graduates, with an appreciation of the values of learning and the skills to enable them to engage with those offerings. The characteristation of MOOCs as
cMOOCs xMOOCs (traditional course style offerings) or the looser networked modeled "connectivist MOOCs", xMOOCs cMOOCs, [H/T @r3becca in the comments;-)] represent different educational philosophies: the former may cruelly be described as being based on a model in which the learner expects to be taught (and the instructors expect to profess), whereas the latter requires that participants are engaged in a more personal, yet still collaborative, learning journey, where it is up to each participant to make sense of the world in an open and public way, informed and aided, but also challenged, by other participants. That's how I work every day. I try to make sense of the world to myself, often for a purpose, in public.
Much of my own learning is the direct result of applied problem solving. I try to learn something every day, often as the result of trying to do something each day that I haven't been able to do before. The OUseful.info blog is my own learning diary and a place I can look to refer to things I have previously learned. The posts are written in a way that reinforces my own learning, as a learning resource. The posts often take longer to write than the time taken to discover or originate the thing learned, because in them I try to represent a reflection and retelling of the rationale for the learning event and the context in which it arose: a problem to be solved, my state of knowledge at the time, the means by which I came to make sense of the situation in order to proceed, and the learning nugget that resulted. The thing I can see or do now but couldn't before. Capturing the "I couldn't do X because of Y but now I can, by doing Z" supports a similar form of discovery as the one supported by question and answer sites: the content is auto-optimised to include both naive and expert information, which aids discovery. (It often amused me that course descriptions would often be phrased in the terms and language you might expect to know having completed the course. Which doesn't help the novice discover it a priori, before they have learned those keywords, concepts or phrases that the course will introduce them to...). The posts also try to model my own learning process, demonstrating the confusion, showing where I had a misapprehension of just plain got it wrong. The blog also represents a telling of my own learning journey over an extended period of time, and such may be though of as an uncourse, something that could perhaps be looked at post hoc as a course but that was originated as my own personal learning journey unfolded.
Hmmm… 1500 words for the whole begging letter, so I need to cut the above down to a sentence…
It’s been some time now since I drafted most of my early unit contributions to the TM351 Data management and analysis course. Part of the point (for me) in drafting that material was to find out what sorts of thing we actually wanted to say and help identify the sorts of abstractions we wanted to then build a narrative around. Another part of this (for me) means exploring new ways of putting powerful “academic” ideas and concepts into meaningful, contexts; finding new ways to describe them; finding ways of using them in conjunction with other ideas; or finding new ways of using – or appropriating them – in general (which in turn may lead to new ways of thinking about them). These contexts are often process based, demonstrating how we can apply the ideas or put them to use (make them useful…) or use the ideas to support problem identification, problem decomposition and problem solving. At heart, I’m more of a creative technologist than a scientist or an engineer. (I aspire to being an artist…;-)
Someone who I think has a great take on conceptualising the data wrangling process – in part arising from his prolific tool building approach in the R language – is Hadley Wickham. His recent work for RStudio is built around an approach to working with data that he’s captured as follows (e.g. “dplyr” tutorial at useR 2014 , Pipelines for Data Analysis):
Following an often painful and laborious process of getting data into a state where you can actually start to work with it), you can then enter into an iterative process of transforming the data into various shapes and representations (often in the sense of re-presentations) that you can easily visualise or build models from. (In practice, you may have to keep redoing elements of the tidy step and then re-feed the increasingly cleaned data back into the sensemaking loop.)
Hadley’s take on this is that the visualisation phase can spring surprises on you but doesn’t scale very well, whilst the modeling phase scales but doesn’t surprise you.
To support the different phases of activity, Hadley has been instrumental in developing several software libraries for the R programming language that are particular suited to the different steps. (For the modeling, there are hundreds of community developed and often very specialised R libraries for doing all manner of weird and wonderful statistics…)
In many respects, I’ve generally found the way Hadley has presented his software libraries to be deeply pragmatic – the tools he’s developed are useful and in many senses naturalistic; they help you do the things you need to do in a way that makes practical sense. The steps they encourage you to take are natural ones, and useful ones. They are the sorts of tools that implement the sorts of ideas that come to mind when you’re faced with a problem and you think: this is the sort of thing I need (to be able) to do. (I can’t comment on how well implemented they are; I suspect: pretty well…)
Just as the data wrangling process diagram helps frame the sorts of things you’re likely to do into steps that make sense in a “folk computational” way (in the sense of folk computing or folk IT (also here), a computational correlate to notions of folk physics, for example), Hadley also has a handy diagram for helping us think about the process of solving problems computationally in a more general, problem solving sense:
A cognitive think it step, identifying a problem, and starting to think about what sort of answer you want from it, as well as how you might start to approach it; a describe it step, where you describe precisely what it is you want to do (the sort of step where you might start scribbling pseudo-code, for example); and the computational do it step where the computational grunt work is encoded in a way that allows it to actually get done by machine.
I’ve been pondering my own stance towards computing lately, particularly from my own context of someone who sees computery stuff from a more technology, tool building and tool using context, (that is, using computery things to help you do useful stuff), rather than framing it as a purer computer science or even “trad computing” take on operationalised logic, where the practical why is often ignored.
So I think this is how I read Hadley’s diagram…
Figuring out what the hell it is you want to do (imagining, the what for a particular why), figuring out how to do it (precisely; the programming step; the how); hacking that idea into a form that lets a machine actually do it for you (the coding step; the step where you express the idea in a weird incantation where every syllable has to be the right syllable; and from which the magic happens).
One of the nice things about Hadley’s approach to supporting practical spell casting (?!) is that transformation or operational steps his libraries implement are often based around naturalistic verbs. They sort of do what they say on the tin. For example, in the dplyr toolkit, there are the following verbs:
These sort of map onto elements (often similarly named) familiar to anyone who has used SQL, but in a friendlier way. (They don’t SHOUT AT YOU for a start.) It almost feels as if they have been designed as articulations of the ideas that come to mind when you are trying to describe (precisely) what it is you actually want to do to a dataset when working on a particular problem.
In a similar way, the ggvis library (the interactive chart reinvention of Hadley’s ggplot2 library) builds on the idea of Leland Wilkinson’s “The Grammar of Graphics” and provides a way of summoning charts from data in an incremental way, as well as a functionally and grammatically coherent way. The words the libraries use encourage you to articulate the steps you think you need to take to solve a problem – and then, as if by magic, they take those steps for you.
If programming is the meditative state you need to get into to cast a computery-thing spell, and coding is the language of magic, things like dplyr help us cast spells in the vernacular.
I came across Apache Tika a few weeks ago, a service that will tell you what pretty much any document type is based on it’s metadata, and will have a good go at extracting text from it.
With a prompt and a 101 from @IgorBrigadir, it was pretty easier getting started with it – sort of…
First up, I needed to get the Apache Tika server running. As there’s a containerised version available on dockerhub (logicalspark/docker-tikaserver), it was simple enough for me to fire up a server in a click using tutum (as described in this post on how to run OpenRefine in the cloud in just a couple of clicks and for a few pennies an hour; pretty much all you need to do is fire up a server, start a container based on logicalspark/docker-tikaserver, and tick to make the port public…)
His suggested recipe for using python requests library borked for me – I couldn’t get python to open the file to get the data bits to send to the server (file encoding issues; one reason for using Tika is it’ll try to accept pretty much anything you throw at it…)
I had a look at pycurl:
!apt-get install -y libcurl4-openssl-dev
!pip3 install pycurl
but couldn’t make head or tail of how to use it: the pycurl equivalant of curl -T foo.doc http://example.com:9998/rmeta can’t be that hard to write, can it? (Translations appreciated via the comments…;-)
Instead I took the approach of dumping the result of a curl request on the command line into a file:
!curl -T Text/foo.doc http://example.com:9998/rmeta > tikatest.json
and then grabbing the response out of that:
Not elegant, and far from ideal, but a stop gap for now.
Part of the response from the Tika server is the text extracted from the document, which can then provide the basis for some style free text analysis…
I haven’t tried with any document types other than crappy old MS Word .doc formats, but this looks like it could be a really easy tool to use.
And with the containerised version available, and tutum and Digital Ocean to hand, it’s easy enough to fire up a version in the cloud, let alone my desktop, whenever I need it:-)
I’m not sure how many Chrome users follow any of the Google blogs that occasionally describe forthcoming updates to Google warez, but if you don’t you perhaps don’t realise quite how frequently things change. My browser, for example, is at something like version 40, even though I never consciously update it.
One thing I only noticed recently that a tab appeared in the top right hand of the browser showing that I’m logged in (to the browser) with a particular Google account. There doesn’t actually appear to be an option to log out – I can switch user or go incognito – and I’m not sure I remember even consciously logging in to it (actually, maybe a hazy memory, when I wanted to install a particular extension) and I have no idea what it actually means for me to be logged in?
Via the Google Apps Update blog, I learned today that being logged in to the browser will soon support is seemless synching of my Google docs into my Chrome browser environment (Offline access to Google Docs editors auto-enabled when signing into Chrome browser on the web). Following a pattern popularised by Apple, Google are innovating on our behalf and automatically opting us in to behaviours it thinks make sense for us. So just bear that in mind when you write a ranty resignation letter in Google docs and wonder why it’s synched to your work computer on your office desk:
Note that Google Apps users should not sign into a Chrome browser on public/non-work computers with their Google Apps accounts to avoid unintended file syncing.
If you actually have several Google apps accounts (for example, I have a personal one, and a couple of organisational ones: an OU one, an OKF one), I assume that the only docs that are synched are the ones on an account that matches the account I have signed in to in the browser. That said, synch permissions may be managed centrally for organisational apps accounts:
Google Apps admins can still centrally enable or disable offline access for their domain in the Admin console .. . Existing settings for domain-level offline access will not be altered by this launch.
I can’t help but admit that even though I won’t have consciously opted in to this feature, just like I don’t really remember logging in to Chrome on my desktop (how do I log out???) and I presumably agreed to something when I installed Chrome to let it keep updating itself without prompting me, I will undoubtedly find it useful one day: on a train, perhaps, when trying to update a document I’d forgotten to synch. It will be so convenient I will find it unremarkable, not noticing I can now do something I couldn’t do as easily before. Or I might notice, with a “darn, I wish I’d..” then “oh, cool, [kewel…] I can…).
“‘Oceania has always been at war with Eastasia.'” [George Orwell, 1984]
Just like when – after being sure I’d disabled or explicitly no; opted in to any sort of geo-locating or geo-tracking behaviour on my Android phone, I found I must have left a door open somewhere (or been automatically opted in to something I hadn’t appreciated when agreeing to a particular update (or by proxy, agreeing to allow something to update itself automatically and without prompting and with implied or explicit permission to automatically opt me in to new features….) and found I could locate my misplaced phone using the Android Device Manager (Where’s My Phone?).
This idea of allowing applications to update themselves in the background and without prompting is something we have become familiar with in many web apps, and in desktop apps such as Google Chrome, though many apps do still require the user to either accept the update or take an even more positive action to install an update when notified that one is available. (It seems that ever fewer apps require you to specifically search for updates…)
In the software world, we have gone from a world where the things we buy we immutable, to one where we could search for and install updates (eg to operating systems of software applications), then accept to install updates when alerted to the fact, to automatically (and invisibly) accepting updates.
In turn, many physical devices have gone from being purely mechanical affairs, to electro-mechanical ones, to logical-electro-mechanical devices (for example, that include logic elements hardwired into silicon), to ones containing factory programmable hardware devices (PROMs, programmable Read Only Memories), to devices that run programmable and then re</programmable firmware (that is to say, software).
If you have a games console, a Roku or MyTV box, or Smart TV, you’ve probably already been prompted to get a (free) online update. I don’t know, but could imagine, new top end cars having engine management system updates at regular service events.
However, one thing perhaps we don’t fully appreciate is that these updates can also be used to limit functionality that our devices previously had. If the updates are done seemlessly (without permission, in the background) this may come as something as a surprise. [Cf. the complementary issue of vendors having access to “their” content on “your” machine, as described here by the Guardian: Amazon wipes customer’s Kindle and deletes account with no explanation]
A good example of loss of functionality arising by an (enforced, though self-applied) firmware update was reported recently in the context of hobbiest drones:
On Wednesday, SZ DJI Technology, the Chinese company responsible for the popular DJI Phantom drones that online retailers sell for less than $500, announced that it had prepared a downloadable firmware update for next week that will prevent drones from taking off in restricted zones and prevent flight into those zones.
Michael Perry, a spokesman for DJI, told the Guardian that GPS locating made such an update possible: “We have been restricting flight near airports for almost a year.”
“The compass can tell when it is near a no-fly zone,” Perry said. “If, for some reason, a pilot is able to fly into a restricted zone and then the GPS senses it’s in a no-fly zone, the system will automatically land itself.”
DJI’s new Phantom drones will ship with the update installed, and owners of older devices will have to download it in order to receive future updates.
What correlates might be applied to increasingly intelligent cars, I wonder?! Or at the other extreme, phones..?
PS How to log out of Chrome You need to administer yourself… From the Chrome Preferences Settings (sic), Disconnect your Google account.
Note that you have to take additional action to make sure that you remove all those synched presentations you’d prepared for job interviews at other companies from the actual computer…
Take care out there…!;-)
A couple of years ago or so, Dropbox ran a promotion for academic users granting 15GB of space. Yesterday, I got an email:
As part of your school’s participation in Space Race, you received 15 GB of additional Dropbox space. The Space Race promotional period expires on March 4, 2015, at which point your Dropbox limit will automatically return to 5 GB.
As a friendly reminder, you’re currently using 14.6 GB of Dropbox space. If you’re over your 5 GB limit after March 4, you’ll no longer be able to save new photos, videos, and docs to Dropbox.
Need more space? Dropbox Pro gives you 1 TB of space to keep everything safe, plus advanced sharing controls, remote wipe for lost devices, and priority support. Upgrade before March 4 and we’ll give you 30% off your first year.
My initial thought was to tweet:
but then I thought again… The discounted price on a monthly payment plan is £5.59/month which on PayPal converted this month to $8.71. I use Dropbox all the time, and it forms part of my workflow for using Leanpub. As it’s the start of the month, I received a small royalty payment for the Wrangling F1 Data With R book. The Dropbox fee is about amount I’m getting per book sold, so it seems churlish not to subscribe to Dropbox – it is part of the cost of doing business, as it were.
The Dropbox subscription gets me 1TB, so this also got me thinking:
- space is not now an issue, so I can move the majority of my files to Dropbox, not just a selection of folders;
- space is not now an issue, so I can put all my github clones into Dropboxl
- space is now now an issue, so though it probably goes against terms of service, I guess I could set up toplevel “family member” folders and we could all share the one subscription account, just selectively synching our own folders?
In essence, I can pretty much move to Dropbox (save for those files I don’t want to share/expose to US servers etc etc; just in passing, one thing Dropbox doesn’t seem to want to let me do is change the account email to another email address that I have another Dropbox account associated with. So I have a bit of an issue with juggling accounts…)
When I started my Wrangling F1 Data With R experiment, the intention was always to make use of any royalties to cover the costs associated with that activity. Leanpub pays out if you are owed more than $40 collected in the run up to 45 days ahead of a payment date (so the Feb 1st payout was any monies collected up to mid-December and not refunded since). If I reckon on selling 10 books a month, that gives me about $75 at current running. Selling 5 a month (so one a week) means it could be hit or miss whether I make the minimum amount to receive a payment for that month. (I could of course put the price up. Leanpub lets you set a minimum price but allows purchasers to pay what they want. I think $20 is the highest amount paid for a copy I’ve had to date, which generated a royalty of $17.50 (whoever that was – thank you :-)) You can also give free or discounted promo coupons away.) As part of the project is to explore ways of identifying and communicating motorsport stories, I’ve spent royalties so far on:
- a subscription to GP+ (not least because I aspire to getting a chart in there!;-);
- a subscription to the Autosport online content, in part to gain access to forix, which I’d forgotten is rubbish;
- a small donation to sidepodcast, because it’s been my favourite F1 podcast for a long time.
Any books I buy in future relating to sports stats or motorsport will be covered henceforth from this pot. Any tickets I buy for motorsport events, and programmes at such events, will also be covered from this pot. Unfortunately, the price of an F1 ticket/weekend is just too much. A Sky F1 Channel subscription or day passes is also ruled out because I can’t for the life of me work out how much it’ll cost or how to subscribe; but I suspect it’ll be more than the £10 or so I’d be willing to pay per race (where race means all sessions in a race weekend). If my F1 iOS app subscription needs updating that’ll also count. Domain name registration (for example, I recently bought f1datajunkie.com) is about £15/$25 a year from my current provider. (Hmm, that seems a bit steep?) I subscribe to Racecar Engineering (£45/$70 or so per year), the cost of which will get added to the mix. A “big ticket” item I’m saving for (my royalties aren’t that much) on the wants list is a radio scanner to listen in to driver comms at race events (I assume it’d work?). I’d like to be able to make a small regular donation to help keep the ergast site on, but can’t see how to… I need to bear in mind tax payments, but also consider the above as legitimate costs of a self-employed business experiment.
I also figure that as an online publishing venture, any royalties should also go to supporting other digital tools I make use of as part of it. Some time ago, I bought in to the pinboard.in social bookmarking service, I used to have a flickr pro subscription (hmm, I possibly still do? Is there any point…?!) and I spend $13 a year with WordPress.com on domain mapping. In the past I have also gone ad-free ($30 per year). I am considering moving to another host such as Squarespace ($8 per month), because WordPress is too constraining, but am wary of what the migration will involve and how much will break. Whilst self-hosting appeals, I don’t want the grief of doing my own admin if things go pear shaped.
I’m a heavy user of RStudio, and have posted a couple of Shiny apps. I can probably get by on the shinyapps.io free plan for a bit (10 apps) – just – but the step up to the basic plan at $39 a month is too steep.
I used to use Scraperwiki a lot, but have moved away from running any persistent scrapers for some time now. morph.io (which is essentially Scraperwiki classic) is currently free – though looks like a subscription will appear at some point – so I may try to get back into scraping in the background using that service. The Scraperwiki commercial plan is $9/month for 10 scrapers, $29 per month for 100. I have tended in the past to run very small scrapers, which means the number of scrapers can explode quickly, but $29/month is too much.
I also make use of github on a free/open plan, and while I don’t currently have any need for private repos, the entry level micro-plan ($7/month) offers 5. I guess I could use a (private?) github rather than Dropbox for feeding Leanpub, so this might make sense. Of course, I could just treat such a subscription as a regular donation.
It would be quite nice to have access to IPython notebooks online. The easiest solution to this is probably something like wakari.io, which comes in at $25/month, which again is a little bit steep for me at the moment.
In my head, I figure £5/$8/month is about one book per month, £10/$15 is two, £15/$20 is three, £25/$40 is 5. I figure I use these services and I’m making a small amount of pin money from things associated with that use. To help guarantee continuity in provision and maintenance of these services, I can use the first step of a bucket brigade style credit apportionment mechanism to redistribute some of the financial benefits these services have helped me realise.
Ideally, what I’d like to do is spend royalties from 1 book per service per month, perhaps even via sponsored links… (Hmm, there’s a thought – “support coupons” with minimum prices set at the level to cover the costs of running a particular service for one month, with batches of 12 coupons published per service per year… Transparent pricing, hypothecated to specific costs!)
Of course, I could also start looking at running my own services in the cloud, but the additional time cost of getting up and running, as well as hassle of administration, and the stress related to the fear of coping in the face of attack or things properly breaking, means I prefer managed online services where I use them.
A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].
The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).
If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.
Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.
The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.
APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.
As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.
Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.
This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.
If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)
Erm… that’s it…
PS see also Evaluating big deal journal bundles (via @kpfssport)
PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.
Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)
We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).
If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.
One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.
One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.
Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.
If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).
Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.
As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.
OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.
Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.
PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.