Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: