Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

Things I Take for Granted #287 – Grabbing Stuff from Web Form Drop Down Lists

Over the years, I’ve collected lots of little hacks for tinkering with various data sets. Here’s an example…

A form on a web page with country names that map to code values:

Advanced_Search_-_ROARMAP

If we want to generate a two column look up table from the names on the list to the values that encode them, we can look to the HTML source, grab the list of elements, then use a regular expression to to extract the names and values and rewrite them in two column, tab separated text file, with one item per line:

regexp form exractor

NOTE: the last character in the replace is \n (newline character). I grabbed the screenshot when the cursor was blinking on:-(

Using Regular Expressions to Filter Tweets Based on the Number of Times a Pattern Appears Within Them

Every so often, it’s good to get a little practice in  using regular expressions… Via the wonderful F1 Metrics blog, I noticed that the @f1debrief twitter account had been tweeting laptimes from the F1 testing sessions. The data wasn’t republished in the F1metrics blog (though I guess I could have tried to scrape it from the charts) but it still is viewable on the @f1debrief timeline, so I grabbed the relevant tweets using the Twitter API statuses/user_timeline call:

response1 = make_twitter_request(twitter_api.statuses.user_timeline, 
                                screen_name='f1debrief',count=200,exclude_replies='true',trim_user='true',
                               include_rts='false',max_id='705792403572178944')
response2 = make_twitter_request(twitter_api.statuses.user_timeline, 
                                screen_name='f1debrief',count=200,exclude_replies='true',trim_user='true',
                               include_rts='false',max_id=response1[-1]['id'])
tweets=response1+response2

The tweets I was interested in look like this (and variants thereof):

F1_Debrief___f1debrief____Twitter

The first thing I wanted to do was to limit the list of tweets I’d grabbed to just the ones that contained a list of laptimes. The way I opted to do this was to create a regular expression that spotted patterns of the form N.NN, and then select tweets that had three or more instances of this pattern. The regular expression .findall method will find all instances of the specified pattern in a string and return them in a list.

import re

regexp = re.compile(r'\d\.\d')

#reverse the order of the tweets so they are in ascending time order
for i in tweets[::-1]:
    if len(re.findall(regexp, i['text']))>=3:
        #...do something with the tweets containing 3 or more laptimes

Inspecting several of the timing related tweets, they generally conform to a pattern of:

  • first line: information about the driver and the tyres (in brackets)
  • a list of laptimes, each time on a separate line;
  • an optional final line that typically started with a hashtag

We can use a regular expression match to try to pull out the name of the driver and tyre compound based on a common text pattern:

#The driver name typically appears immediately after the word του
#The tyre compound appears in brackets
regexp3 = re.compile('^.* του (.*).*\s?\((.*)\).*')
#I could have tried to extract drivers more explicitly from a list of drivers names I knew to be participating

#split the tweet text by end of line character
lines=i['text'].split('\n')

#Try to pull out the driver name and tyre compound from the first line
m = re.match(regexp3, lines[0])
if m:
    print('',m.group(1).split(' ')[0],'..',m.group(2))
    #There is occasionally some text between the driver name and the bracketed tyre compound
    #So split on a space and select the first item
    dr=m.group(1).split(' ')[0]
    ty=m.group(2)
else:
    dr=lines[0]
    ty=''

For the timings, we need to do a little bit of tidying. Generally times were of the form N:NN.NN, but some were of the form NN.NN. In addition, there were occasional rogue spaces in the timings. In this case, we can use regular expressions to substitute on a particular pattern:

for j in lines[1:]:
            j=re.sub('^(\d+\.)','1:\\1',j)
            j=re.sub('^(\d)\s?:\s?(\d)','\\1:\\2',j)

The final code can be found in this gist and gives output of the form:

f1test2016__txt_and_notebooks

There are a few messed up lines, as the example shows, but these are easily handled by hand. (There is often a trade-off between fully automating and partially automating a scrape. Sometimes it can be quick just to do a final bit of tidying up in a text editor.) In the example output, I also put in an easily identified line (starting with == that shows the original first line of a tweet (it also has the benefit of making it easy to find the last line of the previous tweet, just in case that needs tidying too…) These marker lines can easily be removed from the file using a regular expression pattern as the basis of a search and replace (replacing with nothing to delete the line).

So that’s three ways of using regular expressions – to count the occurrences of a pattern and use that as the basis of a filter; to extract elements based on pattern matching in a string; and as the basis for a direct pattern based string replace/substitution.