Using OpenRefine to Clean Multiple Documents in the Same Way

When working with data that is published on a monthly basis according to the same template, it is often the case that we need to apply the same data cleaning rules to the data each time a new file is released. This recipe shows to use OpenRefine to create a reusable script for cleaning data files that get published month on month according to the same schedule.

To make things a little more concrete, consider this example. Under UK transparency regulations, local councils publish spending data for amounts over £500 on a monthly basis:

IW council transparency data

To get to the actual download link for the CSV data files requires another click… (example CSV file, if you want to play along…;-)

IW council transparency data download link

If we want to get this data into a site such as OpenSpending.org, we need to do a little bit of tidying of the data to get it into the format that OpenSpending expects (see for example The OpenSpending data format). As each new file is released, we need to to clean it as we have cleaned the files before. If you are a confident programmer, you could write a script to handle this process. But what can the rest of us to to try to automate this process and make life a little easier.

One way is to use OpenRefine’s ability to “replay” cleaning scripts that you have generated before. Here’s an example of how to do just that…

Let’s start by loading the data in from a CSV file on the Isle of Wight local council website – we need to copy the download link URL ourselves:

OPenRefine - import from CSV

The recent spending files are in a common format (that is, they have been published according to the same template), which is something we’ll be relying on, so we could load multiple files in at once into one big data file, but in this case I’m going to take each file separately.

OpenRefine makes a good guess at the file format. One minor tweak we might make is to ignore any blank lines (OpenSpending doesn’t like blank lines!).

openrefien import csv settings

Here’s what the data looks like once we import it:

preview the datai in openrefine

OpenSpending expects the data to be presented in a particular way, which is why we need to clean the data a little before we can upload it. For example, OpenSpending likes column names that are free of spaces and punctuation; it requires an amount column that just contains numbers (so no commas in the number to make it more readable!); it requires dates in the format 2013-07-15 (that is, the yyyy-mm-dd format) (and I think it needs this column to be called time?).

Here’s how we can rename the columns:

openrefien - rename column

Rename each column in turn, as required – for example, remove any punctuation, use camelCase (removeing spaces and using capital letters to make work boundaries), or replace spaces with underscores (_).

Let’s look at the amount column – if we select the numeric facet we can see there are lots of things not identified as numbers:

open refine - check amount as numeric

We can preview what the non-numeric values are so we can set about tidying them up…

Openrefine look for non-numerics

So commas appear to be the major issue – let’s remove them by transforming cells in that column that contain a comma by removing the comma.

openrefine transform

We can do this by replacing a comma whenever we see one with nothing (that is, an empty character string) – value.replace(',','')

OpenRefine  - remove comma

Let’s see what effect that has on the numeric facet view:

Open Refine amount after cleaning

Well that seems to have worked – no non-numerics any more… (We might also use the sliders in the numeric facet to check the outlying values to see if they are plausible, or look like they may be errors.)

As far as the dates go, we have dates in the form 17.04.2013 rather than 2013-04-17 so let’s transform them into the required format. However, because there is more scope for things going wrong with this transformation, let’s put the transformed data into a new column:

open refine  - add column

Here’s how we define the contents of that column:

oen refine date transfrom

That is: value.toDate('dd.mm.yy').toString('yyyy-mm-dd')

The first part – value.toDate('dd.mm.yy') interprets the string as a date presented in a given format, and then transforms that data into the rewquired date format: .toString('yyyy-mm-dd')

We can quickly preview that this step has worked by transforming the column to a date type:

open refine trasnform to date

and then preview it using a timeline facet to check that all seems in order.

OPenrefien timeline facet

(If there were non-date or error elements, we could select them and look at the original date and transformed date to see where the problem may lie.)

We don’t want the date formatting for out OpenSpending data file, so we can undo the step that transformed the data into the timeline viewable date format:

openrefine history

So now we have our cleaned data file. How can we apply the same steps to another month? If you look at the history tab, you will see it offers an “Extract” option – this provides a history of the change operations we applied to the dataset.

If we copy this history to a text file somewhere, we can then make use of it again and again.

open refine extract

To see how, open another OpenRefine project and import the data for another month (such as this one). When you have created the project, click on the Undo/Redo history tab and select Apply:

open refine  - appply previous transformations

Paste in the transformation script we grabbed from the previous project:

open refne - extract reuse

Here’s the script I used – https://gist.github.com/psychemedia/6087946

When we apply the script, the data is cleaned using the same operations as previously:

open refine cleanded as if by magic

That is, as long as the new data file is in the same format as the previous one, and only requires the same cleaning operations, we don’t really have much to do – we can just reuse the script we generated the first time we cleaned a file of this type. And each time a new file is published, we have a quick recipe for cleaning it up so we can get it into OpenSpending:-)

Finally, we can export the data for use as required…

open refine export data

Magic:-)

Diff or Chop? Github, CSV data files and OpenRefine

A recent post on the OKFNLabs blog – Diffing and patching tabular data – proposes a visualisation scheme (and some associated tooling) for comparing the differences between two tabular data/CSV files:

csv diff

With Github recently announcing that tabular CSV and TSV files are now previewable as such via a searchable* rendering of the data, I wonder if such view may soon feature on that site? An example of how it might work is described in James Smith’s ODI blogpost Adapting Git for simple data, which also has a recipe for diffing CSV data files in Github as it currently stands.

* Though not column sortable? I guess that would detract from Github’s view of showing files as is…? For more discussion on on the rationale for a “Github for data”, see for example Rufus Pollock’s posts Git and Github for Data and We Need Distributed Revision/Version Control for Data.

So far, so esoteric, perhaps. Because you may be wondering why exactly anyone would want to look at the differences between two data files? One reason may be to compare “original” data sets with data tables that are ostensibly copies of them, such as republications of open datasets held as local copies to support data journalism or watchdog activities. Another reason may be as a tool to support data cleaning activities.

One of my preferred tools for cleaning tabular datasets is OpenRefine. One of the nice features of OpenRefine is that it keeps a history of the changes you have made to a file:

openrefine history

Selecting any one of these steps allows you to view the datafile as it stands at that step. Another way of looking at the data file in the case might be the diff view – that is, a view that highlights the differences between the version of the data file as it is at the current step compared to the original datafile. We might be able to flip between these two views (data file as it is at the current step, versus diff’ed data file at the current step compared to the original datafile) using a simple toggle selector.

A more elaborate approach may allow use to view diffs between the data file at the current step and the previous step, or the current data file and an arbitrary previous step.

Another nice feature of OpenRefine is that it allows you to export a JSON description of the change operations (“chops”?;-) applied to the file:

open refine extract

This is a different way of thinking about changes. Rather than identifying differences between two data files by comparing their contents, all we need is a single data file and the change operation history. Then we can create the diff-ed file from the original by applying the specified changes to the original datafile. We may be some way away from an ecosystem that allows us to post datafiles and change operation histories to a repository and then use those as a basis for comparing versions of a datafile, but maybe there are a few steps we can take towards making better use of OpenRefine in a Github context?

For example, OpenRefine already integrates with Google Docs to allow users to import and export file from that service.

OPen Refine export to google

So how about if OpenRefine were able to check out a CSV file from Github (or use gists) and then check it back in, with differences, along with a chops file (that is, the JSON representation of the change operations applied to the original data file?). Note that we might also have to extend the JSON representation, or add another file fragment to the checking, that associates a particular chops file with a particular checkout version of the data file it was applied to. (How is an OpenRefine project file structured, I wonder? Might this provide some clues about ways of managing versions of data files their associated chops files?)

For OpenRefine to identify which file or files are the actual data files to be pulled from a particular Github repository may at first sight appear problematic, but again the ecosytem approach may be able to help us. If data files that are available in a particular Github repository are identified via a data package description file, an application such as OpenRefine could access this metadata file and then allow users to decide which file it is they want to pull into OpenRefine. Pushing a changed file should also check in the associated chops history file. If the changed file is pushed back with the same filename, all well and good. If the changed file is pushed back with a different name then OpenRefine could also push back a modified data package file. (I guess even if filenames don’t change, the datapackage file might be annotated with a reference to the appropriate chops file?)

And as far as ecosystems go, there are already other pieces of the jigsaw already in place, such as James Smith’s Git Data Viewer (about), which allows you to view data files described via a datapackage descriptor file.

Creating Data from Text – Regular Expressions in OpenRefine

Although data can take many forms, when generating visualisations, running statistical analyses, or simply querying the data so we can have a conversation with it, life is often made much easier by representing the data in a simple tabular form. A typical format would have one row per item and particular columns containing information or values about one specific attribute of the data item. Where column values are text based, rather than numerical items or dates, it can also help if text strings are ‘normalised’, coming from a fixed, controlled vocabulary (such as items selected from a drop down list) or fixed pattern (for example, a UK postcode in its ‘standard’ form with a space separating the two parts of the postcode).

Tables are also quick to spot as data, of course, even if they appear in a web page or PDF document, where we may have to do a little work to get the data as displayed into a table we can actually work with in a spreadsheet or analysis package.

More often than not, however, we come across situations where a data set is effectively encoded into a more rambling piece of text. One of the testbeds I used to use a lot for practising my data skills was Formula One motor sport, and though I’ve largely had a year away from that during 2013, it’s something I hope to return to in 2014. So here’s an example from F1 of recreational data activity that provided a bit of entertainment for me earlier this week. It comes from the VivaF1 blog in the form of a collation of sentences, by Grand Prix, about the penalties issued over the course of each race weekend. (The original data is published via PDF based press releases on the FIA website.)

Viva F1 - penalties - messy data

The VivaF1 site also publishes a visualisation summarising penalty outcomes incurred by each driver:

VIVA F1 - DISPLAY OF PENALTIES

The recreational data puzzle I set myself was this: how can we get the data contained in the descriptive sentences about the penalties into a data table that could be used to ask questions about different rule infractions, and the penalty outcomes applied, and allow for the ready generation of visualisations around that data?

The tool I opted to use was OpenRefine; and the predominant technique for getting the data out of the sentences and in to data columns? Regular expressions. (If regular expressions are new to you, think: search and replace on steroids. There’s a great tutorial that introduces the basics here: Everday text patterns.)

What follows is a worked example that shows how to get the “data” from the VivaF1 site into a form that looks more like this:

data types

Not every row is a tidy as it could be, but there is often a trade off in tidying data between trying to automate every step, and automating steps that clean the majority of the data, leaving some rows to tidy by hand…

So where to start? The first step is getting the “data” into OpenRefine. To do this we can just select the text on the VivaF1 penatlies-by-race page, copy it an paste it in the Clipboard import area of a new project in OpenRefine:

Paste in the data

We can then import the data as line items, ignoring blank lines:

import as line based files

The first step I’m going to take tidying up the data is to generate a column that contains the race name:

Pull out race name

The expression if(value.contains('Prix'),value,'') finds the rows that have the title of the race (they all include “Grand Prix” in their name) and creates a new column containing matches. (The expression reads as follows: if the original cell value contains ‘Prix’ , copy the cell value into the corresponding cell in the new column, else copy across an empty string/nothing that is, ”) We can then Fill Down on the race column to associate each row with particular race.

We can also create another new column containing the description of each penalty notice with a quick tweak of the original expression: if(value.contains('Prix'),'',value). (If we match “Prix”, copy an empty string, else copy the penalty notice.)

Copy the penalty notice

One of the things that we notice is there are some notices that “Overflow” on to multiple lines:

Missed lines...

We can filter using a regular expression that finds Penalty Notice lines that start (^) with a character that does not match a – ([^-]):

Find rows that donl;t start with a -

Looking at the row number, we see serval of the rows are xsecutive – we can edit thesse cells to move all the text into a single cell:

OpenRefine edit

Cut and paste as required…

edit a cell

Looking down the row number column (left hand most column) we see that rows 19, 57 and 174 are now the overflow lines. Remove the filter an in the whole listing, scroll to the appropriate part of the data table and cut the data out of the overflow cell and paste it into the line above.

Cut and append

By chance, I also notice that using “Prix” to grab just race names was overly optimistic!

overly agressive

Here’s how we could have checked – used the facet as text option on the race column…

Facet on race name

Phew – that was the only rogue! In line 56, cut the rogue text from the Race column and paste it into the penalty notice column. Then also paste in the additional content from the overflow lines.

Tidy up the big overflow.

Remove any filters and fill down again on the Race column to fix the earlier error…

The PEnalty Noptice column should now contain blank lines corresponding to rows that originally described the Grand Prix and overflow rows – facet the Penalty Notice column by text and highlight the blank rows so we can then delete them…

prune blank rows

So where are we now? We have a data file with one row per penalty and columns corresponding to the Grand Prix and the penalty notice. We can now start work on pulling data out of the penalty notice sentences.

If you inspect the sentences, you will see they start with a dash, then have the driver name and the team in brackets. Let’s use a regular expression to grab that data:

value.match(/- ([^\(]*)\s\(([^\)]*)\).*/).join('::')

Drivername grab

Here’s the actual regular expression: - ([^\(]*)\s\(([^\)]*)\).* It reads as follows: match a – followed by a space, then grab any characters that don’t contain an open bracket ([^\(]*) and that precede a space \s followed by an open bracket \): all together ([^\(]*)\s\( That gives the driver name into the first matched pattern. Then grab the team – this is whatever appears before the first close bracket: ([^\)]*)\) Finally, match out all characters to the end of the string .*

The two matches are then joined using ::

drivenae captured

We can then split these values to give driver and team columns:

driveteam split

Learning from out previous error, we can use the text facet tool on the drive and team columns just to check the values are in order – it seems like there is one oops in the driver column, so we should probably edit that cell and remove the contents.

facet on name

We can also check the blank lines to see what’s happening there – in this case no driver is mentioned but a team is, but that hasn’t been grabbed into the team column, so we can edit it here:

tweak team

We can also check the text facet view of the team column to make sure there are no gotchas, and pick up/correct any that did slip through.

So now we have a driver column and a team column too (it’s probably worth changing the column names to match…)

team and driver

Let’s look at the data again – what else can we pull out? How about the value of any fine? We notice that fine amounts seem to appear at the end of the sentence and be preceded by the word fined, so we gan grab data on that basis, then replace the euro symbol, strip out any commas, and cast the result to a number type: value.match(/.* fined (.*)/)[0].replace(/[€,]/,'').toNumber()

pull out fines

We can check for other fines by filtering the the Penalty Notice column on the word fine (or the Euro symbol), applying a number facet to the Fine column and looking for blank rows in that column.

tidy up the fines

Add in fine information by hand as required:

edit number

So now we have a column that has the value of fines – which means if we export this data we could do plots that show fines per race, or fines per driver, or fines per team, or calculate the average size of fines, or the total number of fines, for example.

What other data columns might we pull out? How about the session? Let’s look for phrases that identify free practice sessions or qualifying:

Session extraction

Here’s the regular expression: value.match(/.*((FP[\d]+)|(Q[\d]+)|(qualifying)).*/i)[0].toUppercase() Note how we use the pipe symbol | to say ‘look for one pattern OR another’. We can cast everything to uppercase just to help normalise the values that appear. And once again, we can use the Text Facet to check that things worked as we expected:

facet session

So that’s a column with the session the infringement occurred in (I think! We’d need to read all the descriptions to make absolutely sure!)

What else? There’s another set of numbers appear in some of the notices – speeds. Let’s grab those into a new column – look for a space, followed by numbers or decimal points, and then a sapce and km/h, grabbing the numbers of interest and casting them to a number type:

value.match(/.*\s([\d\.]+) km\/h.*/)[0].toNumber()

Speeding

So now we have a speed column. Which means we could start to look at speed vs fine scatterplots, perhaps, to see if there is a relationship. (Note, different pit lanes may have different speed limits.)

What else? It may be worth trying to identify the outcome of each infringement investigation?

value.match(/.*((fine)|(no further action)|([^\d]\d+.place grid.*)|(reprimand)|(drive.thr.*)|(drop of.*)|\s([^\s]+.second stop and go.*)|(start .*from .*)).*/i)[0].toUppercase()

Outcome grab

Here’s where we’re at now:

useful data so far

If we do a text facet on the outcome column, we see there are several opportunities for clustering the data:

Facet on outcome and cluster

We can try other cluster types too:

other clusters

If we look at the metaphone (soundalike) clusters:

other opportunities for clustering

we notice a couple of other things – an opportunity to normalise 5 PLACE GRID DROP as DROP OF 5 GRID PLACES for example:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS')

Or we might further standardise the outcome of that by fixing on GRID POSITIONS rather than GRID PLACES:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS')

And we might further normalise on numbers rather than number words:

value.replace('-',' ').replace(/(\d+) PLACE GRID DROP/,'DROP OF $1 GRID POSITIONS').replace('GRID PLACES','GRID POSITIONS').replace('TWO','2').replace('THREE','3').replace('FIVE','5').replace('TEN','10')

Clustering again:

further cleaning - in this case

it might make sense to tidy out the (IN THIS CASE… statements:

value.replace(/ \(IN THIS.*/,'')

Depending on the questions we want to ask, it may be worth splitting out whether or not penalties like GRID DROPS are are this event of the next event, as well as generic penalty types (Drive through, stop and go, grid drop, etc)

exaple of outcome penalty types

Finally, let’s consider what sort of infringement has occurred:

infraction type

If we create a new column from the Infraction column, we can then cluster items into the core infraction type:

cluster core infraction

After a bit of tidying, we can start to narrow down on a key set of facet levels:

example core infringement

Viewing down the list further there may be additional core infringements we might be able to pull out.

So here’s where we are now:

data types

And here’s where we came from:

Viva F1 - penalties - messy data

Having got the data into a data from, we can now start to ask questions of it (whilst possible using those conversations to retrun to the data ans tidy it more as we work with it). But that will have to be the subject of another post…

First Baby Steps in Anonymising Data With Open Refine

Whilst preparing for what turned out to be a very enjoyable data at the BBC Data Data in Birmingham on Tuesday, where I ran a session on Open Refine [slides] I’d noticed that one of the transformations Open Refine supports is hashing using either MD5 or SHA-1 algorithms. What these functions essentially do is map a value, such as a name or personal identifier, on to what looks like a random number. The mapping is one way, so give the hash value of a name or personal identifier, you can’t go back to the original. (The way the algorithms work means that there is also a very slight possibility that two different original values will map on to the same hashed value which may in turn cause errors when analysing the data.)

We can generate the hash of values in a column by transforming the column using the formula md5(value) or sha1(value).

openrefineHash

If I now save the data using the transformed (hashed) vendor name (either the SHA-1 hash or the MD5 hash), I can release the data without giving away the original vendor name, but whilst retaining the ability to identify all the rows associated with a particular vendor name.

One of the problems with MD5 and SHA-1 algorithms from a security point of view is that they run quickly. This means that a brute force attack can take a list of identifiers (or generate a list of all possible identifiers), run them through the hashing algorithm to get a set of hashed values, and then look up a hashed value to see what original identifier generated it. If the identifier is a fixed length and made from a fixed alphabet, the attacker can easily generate each possible identifier.

One way of addressing this problem is to just add salt… In cryptography, a salt (sic) is a random term that you add to a value before generating the hash value. This has the advantage that it makes life harder for an attacker trying a brute force search but is easy to implement. If we are anonymising a dataset, there are a couple of ways we can quickly generate a salt term. The strongest way to do this is to generate a column containing unique random numbers or terms as the salt column, and then hash on the original value plus the salt. A weaker way would be to use the values of one of the other columns in the dataset to generate the hash (ideally this should be a column that doesn’t get shared). Even weaker would be to use the same salt value for each hash; this is more akin to adding a password term to the original value before hashing it.

Unfortunately, in the first two approaches, if we create a unique salt for each row, this will break any requirement that a particular identifier, for example, is always encoded as the same hashed value (we need to guarantee this if we want to do analysis on all the rows associated with it, albeit with those rows identified using the hashed identifier). So when we generate the salt, we ideally want a unique random salt for each identifier, and that salt to remain consistent for any given identifier.

If you look at the list of available GREL string functions you will see a variety of methods for processing string values that we might be able to combine to generate some unique salt values, trusting that an attacker is unlikely to guess the particular combination we have used to create the salt values. In the following example, I generate a salt that is a combination of a “fingerprint” of the vendor name (which will probably, though not necessarily, be different for each vendor name, and add to it a secret fixed “password” term). This generates a consistent salt for each vendor name that is (probably) different from the salt of every other vendor name. We could add further complexity by adding a number to the salt, such as the length of the vendor name (value.length()) or the remainder of the length of the vendor name divided by some number (value.length()%7, for example, in this case using modulo 7).

operefinesalt

Having generated a salt column (“Salt”), we can then create hash values of the original identifier and the salt value. The following shows both the value to be hashed (as a combination of the original value and the salt) and the resulting hash.

openrefinesalthash

As well as masking identifiers, anonymisation strategies also typically require that items that can be uniquely identified because of their low occurrence in a dataset. For example, in an educational dataset, a particular combination of subjects or subject results might uniquely identify an individual. Imagine a case in which each student is given a unique ID, the IDs are hashed, and a set of assessment results is published containing (hashed_ID, subject, grade) data. Now suppose that only one person is taking a particular combination of subjects; that fact might then be used to identify their hashed ID from the supposedly anonymised data and associate it with that particular student.

OpenRefine may be able to help us identify possible problems in this respect by means of the faceted search tools. Whilst not a very rigorous approach, you could for example trying to query the dataset with particular combinations of facet values to see how easily you might be able to identify unique individuals. In the above example of (hashed_ID, subject, grade) data, suppose I know there is only one person taking the combination of Further Maths and Ancient Greek, perhaps because there was an article somewhere about them, although I don’t know what other subjects they are taking. If I do a text facet on the subject column and select the Further Maths and Ancient Greek values, filtering results to students taking either of those subjects, and I then create a facet on the hashed ID column, showing results by count, there would only be one hashed ID value with a count of 2 rows (one row corresponding to their Further Maths participation, the other to their participation in Ancient Greek. I can then invade that person’s privacy by searching on this hashed ID value to find out what other subjects they are taking.

Note that I am not a cryptographer or a researcher into data anonymisation techniques. To do this stuff properly, you need to talk to someone who knows how to do it properly. The technique described here may be okay if you just want to obscure names/identifiers in a dataset you’re sharing with work colleagues without passing on personal information, but it really doesn’t do much more than that.

PS A few starting points for further reading: Broken Promises of Privacy: Responding to the Surprising Failure of Anonymization, Privacy, Anonymity, and Big Data in the Social Sciences and The Algorithmic Foundations of Differential Privacy.

OpenRefine Style Reconciliation Containers

Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s (RIP) Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.

Having already popped OpenRefine into a docker container, I thought I’d also explore dockerising Michael’s service: docker-reconciliation.

The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.

To run service in boot2docker:

  • docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
  • boot2docker ip to get the IP address the service is running on, eg 192.168.59.103
  • Test the service in your browser: http://192.168.59.103:8002/reconcile?query=David Cameroon

In OpenRefine, set the reconciliation service URL to http://192.168.59.103:8002/reconcile.

NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:

docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine

and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (http://192.168.59.103:8002/reconcile).

I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:

docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation

This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.

It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:

docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv

This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.

So for example:

  • docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
  • docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.

In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)

One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.

Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like registry.datahub.docker.com and psychemedia/reconciliation from registry.hub.docker.com.

So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)

And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.

Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.

(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(

Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…

It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:

Data_Scientist_Workbench

The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.

The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…

RStudio_and_OpenRefine_and_Data_Scientist_Workbench_and_Edit_Post_‹_OUseful_Info__the_blog____—_WordPress

So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)

See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.

Quick Notes – OpenRefine – Working With Databases

It seems as if the latest version of OpenRefine – v. 3.0 – is out in beta ([repo]) and it looks at if some integration with databases is provided.

On the way in, you can connect to a database and import data from it into OpenRefine directly:

Looks interesting, though part of me wonders “why?”: if the data is in the database then presumably you’ve started to clean it anyway. One of the things I’ve realised working on some revisions for our TM351 data management and analysis course is that you can use table constraints as part of a data cleaning workflow to flag up errors in your data. It could be interesting to imagine how constraint violation errors might be used to support data cleaning activities in an application such as OpenRefine. (In a sense, you might think of applying facets to a column as the application of rules or constraints that can identify rows containing data that violates them.)

On the way out is the option to export a project as a SQL file. This includes defining column types:

and also configuring other aspects of the SQL export file, such as exporting table definitions, either with or without the data, and optionally including DROP and IF EXISTS statements:

Here’s an example of the sort of thing it can generate:

DROP TABLE IF EXISTS clipboard;
CREATE TABLE clipboard (
a INT NULL,
b VARCHAR(12) NULL,
c INT NULL
);
INSERT INTO clipboard (a,b,c) VALUES
( 1,'e',1 ),
( 2,'w',3 ),
( 2,'tg',56 )

Going back to the notion of constraints, if an exportable SQL table is defined with particular column types, it would be handy to be able to use the type definitions to check the data being exported. For example, having set the types, apply those types as exclusive facet rules on the relevant columns and then raise an alert and/or preview the violating rows if any rows are returned that break the type rules.

If such checking was possible, you could start to imagine round tripping on the data, loading in a table definition and using it to check some aspects of a dataset loaded alongside it (such as data types, uniqueness etc.) Which reminds me, the SQL exporter doesn’t allow seem to support UNIQUE or PRIMARY KEY definitions?

Running OpenRefine On Digital Ocean Using Simple Auth

As a cloud host, Digital Ocean provides a really easy way in to getting services up and running on the web.

Here’s a quick recipe for getting Open Refine up and running behind a simple authentication scheme.

Creating a Server With Digital Ocean

First up, create a Digital Ocean account if you don’t already have one (this link will get you started with $100 free credit).

Creating and launching a server is easy… Select Create then Droplet, choose the server type you want — let’s use a simple Ubuntu box — and choose the server size. For lots of quick tests, I use the smallest box, but from experience I think OpenRefine prefers MORE THAN 2GB or more

Really… If you pick a 2GB server, you may find that OpenRefine hangs on start and ruins your day when you end up trying to debug what you think other other problems…. Be warned, kids… stay unfrustrated out there…

The servers are charged at a metered rate, and you can stop them any time, so for a quick test, it’ll cost you pennies… (A $100 credit can go a long way…!)

Next, choose a data center region; I generally pick a local one…

You also have the option of adding an ssh key. This makes life much easier when trying to log in to the server from your own machine using ssh (you can just run ssh root@IP_ADDRESS and it’ll log you straight in; there’s a recipe for setting up an SSH key here).

If you don’t want to set up an SSH, a root password will be emailed to you. You can use this password to log in to your server via a web terminal, which means you can do everything via a web UI if you need to…

Create your server by clicking the big green button…

It should only take a few seconds to start up… And when it has, you’ll be presented with it’s public IP address.

If you need a web terminal, click through the on the server name, and you should see a link to launch a web console.

Installing  OpenRefine

From the console, we can install all we need to run OpenRefine. This is a minimum viable example — we should probably find a better place to install OpenRefine, and may want to run it as a particular user with limited permissions. Working as root with everything wide open makes life easier, though not necessarily safer…!

OpenRefine requires a Java environment, so we need to install that:

apt-get update && apt-get install -y openjdk-8-jre

We can download the OpenRefine application via the command-line using a command of the form wget -q -O DOWNLOADED_FILE_NAME URL; the download link for each release can be found on the OpenRefine releases page:

wget -q -O openrefine-2.8.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/2.8/openrefine-linux-2.8.tar.gz

The downloaded file is provided as a tar archive file, which we need to unpack:

tar xzf openrefine-2.8.tar.gz

This will unbundle the files into the directory ./openrefine-2.8.

Let’s create an alias for a working directory in which to place the OpenRefine project files:

OPENREFINE_DIR="$HOME/openrefine"

And create that directory:

mkdir -p $OPENREFINE_DIR

You should now be able to run OpenRefine in the background using the command:

nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

This will run OpenRefine on port 3333. If you copy the IP address of your Digital Ocean server, and go to http://IP.ADDRESS:3333, you should see OpenRefine running there. (Note that if you’re in Chrome, Google may well tell you that the address is dangerous…)

Adding Simple Authentication

The OpenRefine server is running as a public service on the public web. If you want to add a simple layer of authentication, you can add a web server proxy to the server that will prompt for a password when a new visit is made to the server.

One of the easiest proxies to get up and running is nginx. Let’s install it, along with a simple Apache toolkit that will help us create a simple password:

apt-get install -y nginx apache2-utils

Now create a simple user/password combination. Ever secure, I’ll go with user test and password letmein:

htpasswd -b -c /etc/nginx/.htpasswd test letmein

Now we need to define the proxy. A Digital Ocean tutorial (How To Install Nginx on Ubuntu 18.04) describes how to set up a firewall – I’m selecting the 'Nginx Full' option because I’m working via SSH, but if you’re working in the web terminal, the more restrictive 'Nginx HTTP' may be more appropriate:

sudo ufw allow 'Nginx Full'

If you try loading the OpenRefine server on port 3333, you should now find that it’s blocked: the firewall is only letting web traffic through on port 80.

We now need to open access back up to the OpenRefine server, albeit via a password challenge. The following will create a default nginx configuration file that will expose the OpenRefine service running on port 3333 via default http port 80, mediated by a simple authorisation challenge:

config='''
server {
  listen 80;
  auth_basic Protected...;
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://127.0.0.1:3333;
  }
}
'''

echo "$config" > /etc/nginx/sites-available/default

Restart the nginx proxy to put the new settings into effect:

nginx -s reload

If you now go to http://IPADDRESS you should be presented with a challenge. Enter the credentials you defined, and you should see your OpenRefine server:

Finishing Up

When you’ve finished your session, you can destroy the droplet. This will tear the server down and you won’t be billed for it anymore.

Alternatively, you can switch the droplet off, but keep it in a shutdown state that you can restart in the future.

However, as the above prompt suggests, you will continue to be billed, even if the service is not running, because it is still consuming Digital Ocean resources.

Authenticated OpenRefine Server on Digital Ocean, Redux

Following on from yesterday’s recipe showing how to Running OpenRefine On Digital Ocean Under Simple Auth, here’s an even easier way that doesn’t require ssh and doesn’t require console access: just copy and paste some set-up info into a form on the Digital Ocean droplet creation page.

Every little everything helps… this link will set new users up with $100 of free credit on Digital Ocean; somewhere down the line I may get a small amount of affiliate link powered Digital Ocean credit to help keep my own server costs covered…

The Digital Ocean droplet creator has an option to add a User data start-up script when the droplet is created (docs).

This means we can provide a script that will download and install everything we need, including a file to define a service that will run OpenRefine.

Copy and paste the script in the gist that can be found here into the user data area before you create the droplet. The code will be executed once the droplet is up and running and should install the nginx proxy and the OpenRefine application. A default user (test) and password (letmein) are defined in script.

If you are trusting of the gist, you can install it more succinctly from the gist repository. You can also define your own user and password. To take this installation route, use the code that follows below in the userdata area:

Here’s the code…:

#!/bin/bash

#We can over-ride the default credentials
USER_NAME=testuser
USER_PWD=testpwd

#Get the latest version of installer script and run it

source <( curl -s https://gist.githubusercontent.com/psychemedia/f256960c112347dd410c2beec8ce05e3/raw/ )

(There should be no spaces between the less than and open bracket characters in the source command.)

For an explicit link to a particlar version of the script, go to the gist, and copy the link for the raw version of the latest file or the version you want.

Note that this route requires you to place considerable trust in the publisher of the remote installation script. Do you really trust a file with such a dodgy filename?

It will probably take two or three minutes to download and install everything, so don’t be too worried if you don’t see anything at the provided IP address immediately. Just keep refreshing the page…

The first sign of life you should see is the nginx default page…

Wait a few moments and reload the page… Next up is a holding page as the OpenRefine application is installed and the service started…

This page should automatically refresh itself every 5 seconds or so. When the service is up and running, you should get a page like this:

Click the link and you should be prompted with the the user/password authenticator challenge. Provide the username/password and you should be taken to your OpenRefine server.