One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.
(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(
Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…
It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:
The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.
The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…
So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)
See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.
Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:
TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project
The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?
The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:
Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:
The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project
The pattern Appeared to be Column Name: value exept for the Title where there was no colon.
On occasion, a row would contain an exceptional item that did not conform to the pattern:
One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:
If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:
value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)
To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.
value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)
However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?
value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)
So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:
value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)
Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:
forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')
This generates rows of the form:
We can now split these cells into several columns:
We use the :: sequence as the separator:
Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.
We now need to do a little more cleaning. For example, tidying up column names:
And then cleaning down each new column to remove the column heading.
As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):
To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)
As ever, writing up this post took as long as working out the recipe…
PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)
Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.
The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.
To run service in boot2docker:
- docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
- boot2docker ip to get the IP address the service is running on, eg 192.168.59.103
- Test the service in your browser: http://192.168.59.103:8002/reconcile?query=David Cameroon
In OpenRefine, set the reconciliation service URL to http://192.168.59.103:8002/reconcile.
NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:
docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine
and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (http://192.168.59.103:8002/reconcile).
I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:
docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation
This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.
It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:
docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv
This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.
So for example:
- docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
- docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.
In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)
One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.
Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like registry.datahub.docker.com and psychemedia/reconciliation from registry.hub.docker.com.
So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)
And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.
Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)
We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).
If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.
One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.
One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.
Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.
If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).
Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.
As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.
OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.
Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.
PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.
Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).
Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.
More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)
The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver:
The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data?
The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.)
What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this:
Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand…
So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine:
We can then import the data as line items, ignoring blank lines:
The first step I’m going to take tidying up the data is to generate a column that contains the race name:
The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race.
We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.)
One of the things that we notice is there are some notices that “Overflow” on to multiple lines:
We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]):
Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell:
Cut and paste as required…
Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above.
By chance, I also notice that using “Prix” to grab just race names was overly optimistic!
Here’s how we could have checked – used the facet as text option on the race column…
Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines.
Remove any filters and fill down again on the Race column to fix the earlier error…
The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them…
So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences.
If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data:
Here’s the actual regular expression: - ([^\(]*)\s\(([^\)]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^\(]*) and that precede a space \s followed by an open bracket \): all together ([^\(]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^\)]*)\) Finally, match out all characters to the end of the string .*
The two matches are then joined using ::
We can then split these values to give driver and team columns:
Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents.
We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here:
We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through.
So now we have a driver column and a team column too (it’s probably worth changing the column names to match…)
Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/).replace(/[€,]/,'').toNumber()
We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column.
Add in fine information by hand as required:
So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example.
What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying:
Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i).toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected:
So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!)
What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type:
So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.)
What else? It may be worth trying to identify the outcome of each infringement investigation?
value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i).toUppercase()
Here’s where we’re at now:
If we do a text facet on the outcome column, we see there are several opportunities for clustering the data:
We can try other cluster types too:
If we look at the metaphone (soundalike) clusters:
we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS')
Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS')
And we might further normalise on numbers rather than number words:
value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')
it might make sense to tidy out the (IN THIS CASE… statements:
value.replace(/ \(IN THIS.*/,'')
Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)
Finally, let’s consider what sort of infringement has occurred:
If we create a new column from the Infraction column, we can then cluster items into the core infraction type:
After a bit of tidying, we can start to narrow down on a key set of facet levels:
Viewing down the list further there may be additional core infringements we might be able to pull out.
So here’s where we are now:
And here’s where we came from:
Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…
A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:
With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.
* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.
So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.
One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:
Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.
A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.
Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:
This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?
For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.
So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)
For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)
And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.
When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule.
To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis:
To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-)
If we want to get this data into a site such as OpenSpending.org, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier.
One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that…
Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves:
The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately.
OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!).
Here’s what the data looks like once we import it:
OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?).
Here’s how we can rename the columns:
Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_).
Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers:
We can preview what the non-numeric values are so we can set about tidying them up…
So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma.
We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','')
Let’s see what effect that has on the numeric facet view:
Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.)
As far as the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column:
Here’s how we define the contents of that column:
That is: value.toDate('dd.mm.yy').toString('yyyy-mm-dd')
The first part – value.toDate('dd.mm.yy') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd')
We can quickly preview that this step has worked by transforming the column to a date type:
and then preview it using a timeline facet to check that all seems in order.
(If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.)
We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format:
So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset.
If we copy this history to a text file somewhere, we can then make use of it again and again.
To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply:
Paste in the transformation script we grabbed from the previous project:
Here’s the script I used – https://gist.github.com/psychemedia/6087946
When we apply the script, the data is cleaned using the same operations as previously:
That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-)
Finally, we can export the data for use as required…