Merging Datasets with Common Columns in Google Refine

It’s an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here’s a way of doing it in Google Refine…

Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy of the Guardian data blog (pulled through the Google Spreadsheets to Yahoo pipes proxy mentioned here):

– University fees data (CSV via pipes proxy)

– University HESA stats, 2010 (CSV via pipes proxy)

Load one data file into a Google Refine/OpenRefine project called Merge Test A, the other into a separate project called Merge Test B.

We can now merge data from the two projects by creating a new column from values an existing column within one project that are used to index into a similar column in the other project. Looking at the two datasets, both HESA Code and institution/University look like candidates for merging the data. Which should we go with? I’d go with the unique identifier (i.e. HESA code in the case) every time…

First, create a new column:

Now do the merge, using the cell.cross GREL (Google Refine Expression Language) command. Trivially, and pinching wholesale from the documentation example, we might use the following command to bring in Average Teaching Score data from the second project into the first:

cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0]

Note that there is a null entry and an error entry. It’s possible to add a bit of logic to tidy things up a little:

if (value!='null',cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0],'')

Here’s the result:

Coping with not quite matching key columns

Another situation that often arises is that you have two columns that almost but don’t quite match. For example, this dataset has a different name representation that the above datasets (Merge Test C):

There are several text processing tools that we can use to try to help us match columns that differ in well-structured ways:

In the above case, where am I creating a new column based on the contents of the Institution column in Merge Test C, I’m using a couple of string processing tricks… The GREL expression may look complicated, but if you build it up in a stepwise fashion it makes more sense.

For example, the command replace(value,"this", "that") will replace occurrences of “this” in the string defined by value with “that”. If we replace “this” with an empty string (” (two single quotes next to each other) or “” (two double quotes next to each other)), we delete it from value: replace(value,"this", "")

The result of this operation can be embedded in another replace statement: replace(replace(value,"this", "that"),"that","the other"). In this case, the first replace will replace occurrences of “this” with “that”; the result of this operation is passed to the second (outer) replace function, which replaces “that” with “the other”). Try building up the expression in realtime, and see what happens. First use:
toLowercase(value)
(what happens?); then:
replace(toLowercase(value),'the','')
and then:
replace(replace(toLowercase(value),'the',''),'of','')

The fingerprint() function then separates out the individual words that are left, orders them, and returns the result (more detail). Can you see how this might be used to transform a column that originally contains “The University of Aberdeen” to “aberdeen university”, which might be a key in another project dataset?

When trying to reconcile data across two different datasets, you may find you need to try to minimise the distance between almost common key columns by creating new columns in each dataset using the above sorts of technique.

Be careful not to create false positive matches though; and also be mindful that not everything will necessarily match up (you may get empty cells when using cell.cross; (to mitigate this, filter rows using a crossed column to find ones where there was no match and see if you can correct them by hand). Even if you don’t completely successful cross data from one project to another, you might manage to automate the crossing of most of the rows, minimising the amount of hand crafted copying you might have to do to tidy up the real odds and ends…

So for example, here’s what I ended up using to create a “Pure key” column in Merge Test C:
fingerprint(replace(replace(replace(toLowercase(value),'the',''),'of',''),'university',''))

And in Merge Test A I create a “Complementary Key” column from the University column using fingerprint(value)

From the Complementary Key column in Merge Test A we call out to Merge Test C: cell.cross("Merge Test C", "Pure key").cells["UCAS ID"].value[0]

Obviously, this approach is far from ideal (and there may be more “correct” and/or efficient ways of doing this!) and the process described above is admittedly rather clunky, but it does start to reveal some of what’s involved in trying to bring data across to one Google Refine project from another using columns that don’t quite match in the original dataset, although they do (nominally) refer to the same thing, and does provide a useful introductory exercise to some of the really quite powerful text processing commands in Google Refine …

For other ways of combining data from two different data sets, see:
– Merging Two Different Datasets Containing a Common Column With R and R-Studio
– A Further Look at the Orange Data Playground – Filters and File Merging
– Merging CSV data files with Google Fusion Tables

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering... View all posts by Tony Hirst

20 thoughts on “Merging Datasets with Common Columns in Google Refine”

Pingback: links for 2011-05-08 « Sarah Hartley

Thanks very much. That hint helped me a lot. That makes Google refine much better. Need to dive into that GREL. Waving from OKFN Germany.

Hi Tony,

Great post! I especially like the use of fingerprint()

A couple of extra tricks for what it’s worth…

Rather than going to the trouble of coding in exclusions, a more general (and I find quicker) way is to use facets. In your first example I would apply a Text Facet to HESA code, scroll down in the list of facet choices, click ‘null’ and then click the (oft-overlooked) invert link in the facet title. You should see a black strikethrough on ‘null’ and three fewer rows.

Although in this case I wouldn’t bother at all as it’s been instructed to “On error: set to blank”

For the fingerprint example, another way of doing it using regular expressions:

value.replace(/the|of|and|&|[^\w\s]+|university/i, ”).toLowercase()

| is alternation, “match the OR of OR …”
[…] is a character class \w is any alphanumeric, \s is whitespace, and the ^ at the beginning inverts the class
+ is one or more instances (in this case, single characters)
/i is case-insensitive match

So it’s removing the, of, and, &, and any non-alphanum that’s not a space(!)

Paul

Tony Hirst says:

June 15, 2011 at 9:12 pm

Hi Paul – thanks for those tips… I feel as if I’ve only just started getting an inkling of the tricks Refine might support… I’ve just started looking at R too, which I suspect can also be appropriated in all sorts of ways… I’m not sure what the best workflow there would be though…?

Pingback: Merging Two Different Datasets Containing a Common Column With R and R-Studio « OUseful.Info, the blog…

HI,

I’ve been searching for a way to add data from one column into another existing column in the same project. Has anyone done this? any tips would be helpful..I’m a newbie.

Thanks!
Adrienne

Martin Magdinier (@MagdMartin) says:

August 8, 2011 at 2:37 pm

Adrienne, have a look at this: http://googlerefine.blogspot.com/2011/06/how-to-merge-several-into-one.html
Once you merge your 2 columns, remove those you don’t need.

Great tutorial here. I didn’t know about this functionality in google refine :)

Thanks Martin! I’ll give this a try. Some of the data in one column is correct but some of it is supposed to be in another column so not positive this will work. Maybe facet or filter then doing this will solve the issue.

best!

Pingback: Hands-outs and plans for data track at Global Investigative Journalism Conference in Kiev

Pingback: Video's van de workshopdag datajournalistiek/RegioHack - RegiohackRegiohack

Pingback: RethinkingMedia » Blog Archive » Video-presentaties van RegioHack / datajournalistiek

Pingback: Video-presentaties van RegioHack / datajournalistiek — Rethinkingmedia

Don’t forget SQLite as a possibility to link two or more CSV files, particularly if you know how to wrangle and do joins with SQL already! It worked well for me recently. Firefox has an excellent SQLite “client” add-on that quickly imports from CSV. Not quite as elegant as Google Refine but a worthy complement.

I downloaded the application yesterday, and tried to load in a couple of data mining files. I couldn’t get data into the application. Didn’t matter whether using a .xlxs, .csv or .txt (tab delimited) — nothing happened, the Refine software status bar said “Done”, but when I clicked on the “Next” button, the status bar changed to “Error on the page”. Couldn’t find any other message or clue as to what could be done to make the upload work.

Tony Hirst says:

February 6, 2012 at 9:32 am

@paul were you trying to import files from a URL or from the desktop? Are you sure your browser is up to it? (I suspect Google Refine has a minimum browser spec?) When things go bad for me, I tend to check the forums, and issue/bug list to see if anyone else has: a) encountered similar problems; b) been advised of a fix.

Pingback: Exploring GP Practice Level Prescribing Data « OUseful.Info, the blog…

There are known issues with the cross function at present. I’ve found that it can match successfully against some columns but fails for some others.

Restarting GR seems to work around the issue for now.

http://code.google.com/p/google-refine/issues/detail?id=432

This bug has caused me some pain.

Tom Morris (@tfmorris) says:

September 5, 2012 at 4:12 am

As I described in my response to the bug report, it should be fixed now if you’re running from the current source. Otherwise, it’ll be fixed at the next release — and, as Craig mentions, restarting will fix any intermittent caching problems, even with the current release.