OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

Posts Tagged ‘openrefine

Merging Datasets with Common Columns in Google Refine

It’s an often encountered situation, but one that can be a pain to address – merging data from two sources around a common column. Here’s a way of doing it in Google Refine…

Here are a couple of example datasets to import into separate Google Refine projects if you want to play along, both courtesy of the Guardian data blog (pulled through the Google Spreadsheets to Yahoo pipes proxy mentioned here):

- University fees data (CSV via pipes proxy)

- University HESA stats, 2010 (CSV via pipes proxy)

We can now merge data from the two projects by creating a new column from values an existing column within one project that are used to index into a similar column in the other project. Looking at the two datasets, both HESA Code and institution/University look like candidates for merging the data. Which should we go with? I’d go with the unique identifier (i.e. HESA code in the case) every time…

First, create a new column:

Now do the merge, using the cell.cross GREL (Google Refine Expression Language) command. Trivially, and pinching wholesale from the documentation example, we might use the following command to bring in Average Teaching Score data from the second project into the first:

cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0]

Note that there is a null entry and an error entry. It’s possible to add a bit of logic to tidy things up a little:

if (value!='null',cell.cross("Merge Test B", "HESA code").cells["Average Teaching Score"].value[0],'')

Here’s the result:

Coping with not quite matching key columns

Another situation that often arises is that you have two columns that almost but don’t quite match. For example, this dataset has a different name representation that the above datasets (Merge Test C):

There are several text processing tools that we can use to try to help us match columns that differ in well-structured ways:

In the above case, where am I creating a new column based on the contents of the Institution column in Merge Test C, I’m using a couple of string processing tricks… The GREL expression may look complicated, but if you build it up in a stepwise fashion it makes more sense.

For example, the command replace(value,"this", "that") will replace occurrences of “this” in the string defined by value with “that”. If we replace “this” with an empty string (” (two single quotes next to each other) or “” (two double quotes next to each other)), we delete it from value: replace(value,"this", "")

The result of this operation can be embedded in another replace statement: replace(replace(value,"this", "that"),"that","the other"). In this case, the first replace will replace occurrences of “this” with “that”; the result of this operation is passed to the second (outer) replace function, which replaces “that” with “the other”). Try building up the expression in realtime, and see what happens. First use:
toLowercase(value)
(what happens?); then:
replace(toLowercase(value),'the','')
and then:
replace(replace(toLowercase(value),'the',''),'of','')

The fingerprint() function then separates out the individual words that are left, orders them, and returns the result (more detail). Can you see how this might be used to transform a column that originally contains “The University of Aberdeen” to “aberdeen university”, which might be a key in another project dataset?

When trying to reconcile data across two different datasets, you may find you need to try to minimise the distance between almost common key columns by creating new columns in each dataset using the above sorts of technique.

Be careful not to create false positive matches though; and also be mindful that not everything will necessarily match up (you may get empty cells when using cell.cross; (to mitigate this, filter rows using a crossed column to find ones where there was no match and see if you can correct them by hand). Even if you don’t completely successful cross data from one project to another, you might manage to automate the crossing of most of the rows, minimising the amount of hand crafted copying you might have to do to tidy up the real odds and ends…

So for example, here’s what I ended up using to create a “Pure key” column in Merge Test C:
fingerprint(replace(replace(replace(toLowercase(value),'the',''),'of',''),'university',''))

And in Merge Test A I create a “Complementary Key” column from the University column using fingerprint(value)

From the Complementary Key column in Merge Test A we call out to Merge Test C: cell.cross("Merge Test C", "Pure key").cells["UCAS ID"].value[0]

Obviously, this approach is far from ideal (and there may be more “correct” and/or efficient ways of doing this!) and the process described above is admittedly rather clunky, but it does start to reveal some of what’s involved in trying to bring data across to one Google Refine project from another using columns that don’t quite match in the original dataset, although they do (nominally) refer to the same thing, and does provide a useful introductory exercise to some of the really quite powerful text processing commands in Google Refine …

For other ways of combining data from two different data sets, see:
- Merging Two Different Datasets Containing a Common Column With R and R-Studio
- A Further Look at the Orange Data Playground – Filters and File Merging
- Merging CSV data files with Google Fusion Tables

Written by Tony Hirst

May 6, 2011 at 12:44 pm

Comparing Columns in Google Refine

A reader (Cosmin Cabulea) writes: “I have two columns (A and B) and want to identify identical cells.”

I think I misapprehended the point of the question, but it prompted me to create this simple example.

In something like Google Spreadsheets, we could use an if statement to set the value of cells in a new column based on a comparison of the values of two other columns in the same row. In column C, cell C1, for example, we might use a formula of the form:

if(A1=B1,'similar','different')

In Google Refine, we can use a GREL expression to achieve a similar effect. Create a new column and then use an expression of the form:

if(cells["A"].value == cells["B"].value, "similar", "different")

where A and B are the appropriate column headings.

Google refine - compare two columns

If you’re generating the new comparison column from one of the two columns you’re comparing (column with header B, say), you can reference the values of the original column directly:

if(cells["A"].value == value, "similar", "different")

Google refine compare two columns

It strikes me that the pattern scales to comparisons across multiple columns and of arbitrary complexity. For example, using a nested if control flow statement:

if( value == cells["Host"].value, if( cells["amount"].value > 75, 2, 1 ), 0 )

Or using a Boolean operator:

if( and( value=="May",cells['amount'].value > 0 ), 2, 0 )

Ref: Google Refine Basic blog: Compare values from two columns

Aggregating Values for Recurring Column Values

So this, it turns out (I think?!), was more in line with what Cosmin was after. Given something like:

A B
1 2
1 3
2 1
2 3
2 4

generate:

A BVALS
1 2,3
2 2,3,4

Here’s a way of doing that using R (I use the R-Studio environment).

Using some (guess what) F1 data, loaded into the dataframe hun_2011proximity, let’s pull out a sample of laptime data (say the first 10 laps of a race), featuring just the car numbers, and the laptimes (ref: R: subsetting data). First we grab just those rows where the lap column value is less than 11, then we create a frame containing only a couple of the columns (car and laptime) from the dataset (the original hun_2011proximity data frame contained 20 or so columns, including two with headers car and laptime, and 70 laps worth of data):

samp1=subset(hun_2011proximity,lap<11)
sampcols=c("car","laptime")
samp2=samp1[sampcols]

(Thinks: would it be more efficient to do this the other way round, and reduce the data set to 2 cols first before extracting just the first 10 laps worth of data?)

samp2 now contains 240 rows describing 10 laps of data, each row containing data for one car from one lap; each row contains car and laptime data (2 cols).

Now we can run down one column, looking for recurring elements, and generate a new column that contains the aggregate values from another column for each unique element in the first column:

samp3=aggregate(samp2$laptime, samp2['car'],paste,collapse=',')

Here’s what we get as a result:

Combining common column value data elements

Ref: [R] aggregate text column by a few rows

A Couple of Alternative Approaches
Chatting to Cosmin, it turns out the actual requirement was to identify common followers of a set of Twitter accounts. So for example, with columns TwitterID FollowedBy, extract the unique FollowedBy Twitter IDs and then aggregate the TwitterID values (something like aggdata=aggregate(twData$TwitterID, twData['FollowedBy'],paste,collapse=’,’)).

One approach to this would be to look at the data in Gephi, plotting edges as a directed graph from FollowedBy to TwitterID, sizing the nodes according to out degree (so we could see how many of the target accounts each person in the union follower set was following). We could then use filters to reduce the set to just people following lots of the accounts.

Following this line of thought, we could also use a network flavoured representation (e.g. using something like networkx) to construct a graph and run stats on it. (So we could e.g. pull out reports describing the distribution of how many people were following how many of the target accounts, etc.)

Of course, on those occasions where the Google Social API returns Twitter follower names rather than redirect IDs, my Common Friends or Followers on Twitter hack will show common followers of two twitter accounts.

Yet another approach, if we have all the data in a single file, is to do a simple bit of counting using a Unix command line tool. For example, if we have comma separated file containing TwitterID (column 1) and FollowedBy (column 2) columns, we can sort the names in the FollowedBy column and count the number of times they reoccur:

cut -d "," -f 2 twitterdata.csv | sort | uniq -c

Ref: Playing With Large (ish) CSV Files, and Using Them as a Database from the Command Line

Written by Tony Hirst

August 6, 2011 at 11:13 am

Posted in Infoskills, Tinkering

Tagged with ,

Social Interest Positioning – Visualising Facebook Friends’ Likes With Data Grabbed Using Google Refine

What do my Facebook friends have in common in terms of the things they have Liked, or in terms of their music or movie preferences? (And does this say anything about me?!) Here’s a recipe for visualising that data…

After discovering via Martin Hawksey that the recent (December, 2011) 2.5 release of Google Refine allows you to import JSON and XML feeds to bootstrap a new project, I wondered whether it would be able to pull in data from the Facebook API if I was logged in to Facebook (Google Refine does run in the browser after all…)

Looking through the Facebook API documentation whilst logged in to Facebook, it’s easy enough to find exemplar links to things like your friends list (https://graph.facebook.com/me/friends?access_token=A_LONG_JUMBLE_OF_LETTERS) or the list of likes someone has made (https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS); replacing me with the Facebook ID of one of your friends should pull down a list of their friends, or likes, etc.

(Note that validity of the access token is time limited, so you can’t grab a copy of the access token and hope to use the same one day after day.)

Grabbing the link to your friends on Facebook is simply a case of opening a new project, choosing to get the data from a Web Address, and then pasting in the friends list URL:

Google Refine - import Facebook friends list

Click on next, and Google Refine will download the data, which you can then parse as a JSON file, and from which you can identify individual record types:

Google Refine - import Facebook friends

If you click the highlighted selection, you should see the data that will be used to create your project:

Google Refine - click to view the data

You can now click on Create Project to start working on the data – the first thing I do is tidy up the column names:

Google Refine - rename columns

We can now work some magic – such as pulling in the Likes our friends have made. To do this, we need to create the URL for each friend’s Likes using their Facebook ID, and then pull the data down. We can use Google Refine to harvest this data for us by creating a new column containing the data pulled in from a URL built around the value of each cell in another column:

Google Refine - new column from URL

The Likes URL has the form https://graph.facebook.com/me/likes?access_token=A_LONG_JUMBLE_OF_LETTERS which we’ll tinker with as follows:

Google Refine - crafting URLs for new column creation

The throttle control tells Refine how often to make each call. I set this to 500ms (that is, half a second), so it takes a few minutes to pull in my couple of hundred or so friends (I don’t use Facebook a lot;-). I’m not sure what limit the Facebook API is happy with (if you hit it too fast (i.e. set the throttle time too low), you may find the Facebook API stops returning data to you for a cooling down period…)?

Having imported the data, you should find a new column:

Google Refine - new data imported

At this point, it is possible to generate a new column from each of the records/Likes in the imported data… in theory (or maybe not..). I found this caused Refine to hang though, so instead I exprted the data using the default Templating… export format, which produces some sort of JSON output…

I then used this Python script to generate a two column data file where each row contained a (new) unique identifier for each friend and the name of one of their likes:

import simplejson,csv

writer=csv.writer(open('fbliketest.csv','wb+'),quoting=csv.QUOTE_ALL)

fn='my-fb-friends-likes.txt'

data = simplejson.load(open(fn,'r'))
id=0
for d in data['rows']:
	id=id+1
	#'interests' is the column name containing the Likes data
	interests=simplejson.loads(d['interests'])
	for i in interests['data']:
		print str(id),i['name'],i['category']
		writer.writerow([str(id),i['name'].encode('ascii','ignore')])

[I think this R script, in answer to a related @mhawksey Stack Overflow question, also does the trick: R: Building a list from matching values in a data.frame]

I could then import this data into Gephi and use it to generate a network diagram of what they commonly liked:

Sketching common likes amongst my facebook friends

Rather than returning Likes, I could equally have pulled back lists of the movies, music or books they like, their own friends lists (permissions settings allowing), etc etc, and then generated friends’ interest maps on that basis.

[See also: Getting Started With The Gephi Network Visualisation App – My Facebook Network, Part I and how to visualise Google+ networks]

PS dropping out of Google Refine and into a Python script is a bit clunky, I have to admit. What would be nice would be to be able to do something like a “create new rows with new column from column” pattern that would let you set up an iterator through the contents of each of the cells in the column you want to generate the new column from, and for each pass of the iterator: 1) duplicate the original data row to create a new row; 2) add a new column; 3) populate the cell with the contents of the current iteration state. Or something like that…

PPS Related to the PS request, there is a sort of related feature in the 2.5 release of Google Refine that lets you merge data from across rows with a common key into a newly shaped data set: Key/value Columnize. Seeing this, it got me wondering what a fusion of Google Refine and RStudio might be like (or even just R support within Google Refine?)

PPPS this could be interesting – looks like you can test to see if a friendship exists given two Facebook user IDs.

PPPPS This paper in PNAS – Private traits and attributes are predictable from digital records of human behavior – by Kosinski et. al suggests it’s possible to profile people based on their Likes. It would be interesting to compare how robust that profiling is, compared to profiles based on the common Likes of a person’s followers, or the common likes of folk in the same Facebook groups as an individual?

Written by Tony Hirst

January 4, 2012 at 11:06 am

Looking up Images Trademarked By Companies Using OpenCorporates and Google Refine

Listening to Chris Taggart talking about OpenCorporates at netzwerk recherche conf – data, research, stories, I figured I really should start to have a play…

Looking through the example data available from an opencorporates company ID via the API, I spotted that registered trademark data was available. So here’s a quick roundabout way of previewing trademarked images using OpenCorporates and Google Refine.

First step is to grab the data – the opencorporates API reference docs give an example URL for grabbing a company’s (i.e. a legal entity’s) data: http://api.opencorporates.com/companies/gb/00102498/data

Google Refine supports the import of JSON from a URL:

(Hmm, it seems as if we could load in data from several URLs in one go… maybe data from different BP companies?)

Having grabbed the JSON, we can say which blocks we want to import as row items:

We can preview the rows to check we’re bringing in what we expect…

We’ll take this data by clicking on Create Project, and then start to work on it. Because the plan is to grab trademark images, we need to grab data back from OpenCorporates relating to each trademark. We can generate the API call URLs from the datum – id column:

The OpenCorporates data item API calls are of the form http://api.opencorporates.com/data/2601371, which we can generate as follows:

Here’s what we get back:

If we look through the data, there are several fields that may be interesting: the “representative_name_lines (the person/group that registered the trademark), the representative_address_lines, the mark_image_type and most importantly of all, the international_registration_number. Note that some of the trademarks are not images – we’ll end up ignoring those (for the purposes of this post, at least!)

We can pull out these data items into separate columns by creating columns directly from the trademark data column:

The elements are pulled in using expressions of the following form:

Here are the expressions I used (each expression is used to create a new column from the trademark data column that was imported from automatically constructed URLs):

  • value.parseJson().datum.attributes.mark_image_type – the first part of the expression parses the data as JSON, then we navigate using dot notation to the part of the Javascript object we want…
  • value.parseJson().datum.attributes.mark_text
  • value.parseJson().datum.attributes.representative_address_lines
  • value.parseJson().datum.attributes.representative_name_lines
  • value.parseJson().datum.attributes.international_registration_number

Finding how to get images from international registration numbers was a bit of a faff. In the end, I looked up several records on the WIPO website that displayed trademarked images, then looked at the pattern of their URLs. The ones I checked seemed to have the form:
http://www.wipo.int/romarin/images/XX/YY/XXYYNN.typ
where typ is gif or jpg and XXYYNN is the international registration number. (This may or may not be a robust convention, but it worked for the examples I tried…)

The following GREL expression generates the appropriate URL from the trademark column:

if( or(value.parseJson().datum.attributes.mark_image_type==’JPG’, value.parseJson().datum.attributes.mark_image_type==’GIF’), ‘http://www.wipo.int/romarin/images/&#8217; + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2)[0] + ‘/’ + splitByLengths(value.parseJson().datum.attributes.international_registration_number, 2, 2)[1] + ‘/’ + value.parseJson().datum.attributes.international_registration_number + ‘.’ + toLowercase (value.parseJson().datum.attributes.mark_image_type), ”)

The first part checks that we have a GIF or JPG image type identified, and if it does, then we construct the URL path, and finally cast the filetype to lower case, else we return an empty string.

Now we can filter the data to only show rows that contain a trademark image URL:

Finally, we can create a template to export a simple HTML file that will let us preview the image:

Here’s a crude template I tried:

The file is exported as a .txt file, but it’s easy enough to change the suffix to .html so that we can view the fie in a browser, or I can cut and paste the html into this page…

null null
null null
“[\"MURGITROYD & COMPANY\"]“ “[\"17 Lansdowne Road\",\"Croydon, Surrey CRO 2BX\"]“
“[\"A.C. CHILLINGWORTH\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON EC2M 7BA\"]“
“[\"A.C. CHILLINGWORTH\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON EC2M 7BA\"]“
“[\"A.C. CHILLINGWORTH\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON EC2M 7BA\"]“
“[\"A.C. CHILLINGWORTH\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON EC2M 7BA\"]“
“[\"BP GROUP TRADE MARKS\"]“ “[\"20 Canada Square,\",\"Canary Wharf\",\"London E14 5NJ\"]“
“[\"Murgitroyd & Company\"]“ “[\"Scotland House,\",\"165-169 Scotland Street\",\"Glasgow G5 8PL\"]“
“[\"BP GROUP TRADE MARKS\"]“ “[\"20 Canada Square,\",\"Canary Wharf\",\"London E14 5NJ\"]“
“[\"BP Group Trade Marks\"]“ “[\"20 Canada Square, Canary Wharf\",\"London E14 5NJ\"]“
“[\"ROBERT WILLIAM BOAD\",\"BP p.l.c. - GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON, EC2M 7BA\"]“
“[\"ROBERT WILLIAM BOAD\",\"BP p.l.c. - GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON, EC2M 7BA\"]“
“[\"ROBERT WILLIAM BOAD\",\"BP p.l.c. - GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON, EC2M 7BA\"]“
“[\"ROBERT WILLIAM BOAD\",\"BP p.l.c. - GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON, EC2M 7BA\"]“
“[\"MURGITROYD & COMPANY\"]“ “[\"17 Lansdowne Road\",\"Croydon, Surrey CRO 2BX\"]“
“[\"MURGITROYD & COMPANY\"]“ “[\"17 Lansdowne Road\",\"Croydon, Surrey CRO 2BX\"]“
“[\"MURGITROYD & COMPANY\"]“ “[\"17 Lansdowne Road\",\"Croydon, Surrey CRO 2BX\"]“
“[\"MURGITROYD & COMPANY\"]“ “[\"17 Lansdowne Road\",\"Croydon, Surrey CRO 2BX\"]“
“[\"A.C. CHILLINGWORTH\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON EC2M 7BA\"]“
“[\"BP Group Trade Marks\"]“ “[\"20 Canada Square, Canary Wharf\",\"London E14 5NJ\"]“
“[\"ROBERT WILLIAM BOAD\",\"GROUP TRADE MARKS\"]“ “[\"Britannic House,\",\"1 Finsbury Circus\",\"LONDON, EC2M 7BA\"]“
“[\"BP GROUP TRADE MARKS\"]“ “[\"20 Canada Square,\",\"Canary Wharf\",\"London E14 5NJ\"]“

Okay – so maybe I need to tidy up the registration related columns, but as a recipe, it sort of works. (Note that it took way longer to create this blog post than it did to come up with the recipe…)

A couple of things that came to mind: having used Google Refine to sketch out this hack, we could now move code it up, maybe in something like Scraperwiki. For example, I only found trademarks registered to one legal entity associated with BP, rather than checking for trademarks held by the myriad number of legal entities associated with BP. I also wonder whether it would be possible to “compile” what Google Refine is doing (import from URL, select row items, run operations against columns, export templated data) as code so that it could be run elsewhere (so for example, could all through steps be exported as a single Javascript or Python script, maybe calling on a GREL/Google Refine library that provides some sort of abstraction layer of virtual machine for the script to make use of?)

PS What’s next…? The trademark data also identifies one or more areas in which the trademark applies; I need to find some way of pulling out each of the “en” attribute values from the items listed in the value.parseJson().datum.attributes.goods_and_services_classifications.

Written by Tony Hirst

March 25, 2012 at 10:40 pm

Data Shaping in Google Refine – Generating New Rows from Multiple Values in a Single Column

One of the things I’ve kept stumbling over in Google Refine is how to use it to reshape a data set, so I had a little play last week and worked out a couple of new (to me) recipes.

The first relates to reshaping data by creating new rows based on columns. For example, suppose we have a data set that has rows relating to Olympics events, and columns relating to Medals, with cell entries detailing the country that won each medal type:

However, suppose that you need to get the data into a different shape – maybe one line per country with an additional column specifying the medal type. Something like this, for example:

How can we generate that sort of view from the original data set? Here’s one way, that works when the columns you want to split into row values are contiguous (that is, next to each other). From the first column in the set of columns you want to be transformed, select Transpose > Transpose cells across columns into rows:

We now set the original selected column headers to be the cell value within a new column – MedalType – and the original cell values the value within a Country column:

(Note that we could also just transform the data into a single column. For example, suppose we had columns relating to courses currently taken by a particular student (Course 1, Course 2, Course 3), with a course code as cell value and one, two or three columns populated per student. If we wanted one row per student per course, we could just map the three columns onto a single column – CourseCode – and assign multiple rows to each student, then filtering out rows with a blank value in the CourseCOde column as required.)

Ticking the Fill down in other columns checkbox ensures that the appropriate Sport and Event values are copied in to the newly created rows:

Having worked out how to do that oft-required bit of data reshaping, I thought I could probably have another go at something that has been troubling me for ages – how to generate multiple rows from a single row where one of the columns contains JSON data (maybe pulled from a web service/API) that contains multiple items. This is a “mate in three” sort of problem, so here’s how I started to try to work it back. Given that I now know how to map columns onto rows, can I work out how to map different results in a JSON response onto different columns?

For example, here’s a result from the Facebook API for a search on a particular OU course code and the word open in a Facebook group name:

{“data”:[{"version":1,"name":"U101 (Open University) start date February 2012","id":"325165900838311"},{"version":1,"name":"Open university, u101- design thinking, October 2011","id":"250227311674865"},{"version":1,"name":"Feb 2011 Starters U101 Design Thinking - Open University","id":"121552081246861"},{"version":1,"name":"Open University - U101 Design Thinking, Feburary 2011","id":"167769429928476"}],”paging”:{“next”:…etc…}}

It returns a couple of results in the data element, in particular group name and group ID. Here’s one way I found of creating one row per group… Start off by creating a new column based on the JSON data column that parses the results in the data element into a list:

We can then iterate over the list items in this new column using the forEach grel command. The join command then joins the elements within each list item, specifically the group ID and name values in each result:

forEach(value.parseJson(),v,[v.id,v.name].join('||'))

You’ll notice that for multiple results, this produces a list of joined items, which we can also join together by extending the GREL expression:

forEach(value.parseJson(),v,[v.id,v.name].join('||')).join('::')

We now have a column that contains ‘||’ and ‘::’ separated items – :: separates individual group results from each other, || separates the id and name for each particular group.

Given we know how to create rows from multiple columns, we could try to split this column into separate columns using Edit column > Split into separate columns. This would create one column per result, which we could then transform into rows, as we did above. Whilst I don’t recommend this route in this particular case, here’s how we could go about doing it…

A far better approach is to use the Edit cells > split multi-valued cells option to automatically create new rows based on splitting the elements in a single column:

Note, however that this creates blanks in the other columns, so we need to Edit cells > Fill down to fill in the blanks in any other columns we want to refer to. After doing that, we end up with something like this:

We could now split the groupPairs column using the || separator to create two columns – Group ID and group name – giving us one row per group, and separate columns identifying the course, group name and group ID.

If the above route seems a little complicated, fear not…Once you apply it, it starts to make sense!

Written by Tony Hirst

July 30, 2012 at 11:50 am

Grabbing Twitter Search Results into Google Refine And Exporting Conversations into Gephi

How can we get a quick snapshot of who’s talking to whom on Twitter in the context of a particular hashtag?

Here’s a quick recipe that shows how…

First we need to grab some search data. The Twitter API documentation provides us with some clues about how to construct a web address/URL that will grab results back from a particular search on Twitter in a machine readable way (that is, as data):

  • http://search.twitter.com/search.format is the base URL, and the format we require is json, which gives us http://search.twitter.com/search.json
  • the query we want is presented using the q= parameter: http://search.twitter.com/search.json?q=searchterm
  • if we want multiple search terms (for example, library skills), they need encoding in a particular way. The easiest was is just to construct your URL, enter it into the location/URL bar of your browser and hit enter, or use a service such as this string encoder. The browser should encode the URL for you. (If the only punctuation in your search phrase are spaces, you can encode them yourself: just change each space to %20, to give something like library%20skills. If you want to encode the # in a hashtag, use %23
  • We want to get back as many results as are allowed at any one time (which happens to be 100), so set rpp=100, that is: http://search.twitter.com/search.json?q=library%20skills&rpp=100
  • results are paged (in the sense of different pages of Google search results, for example), which means we can ask for the first 100 results, the second 100 results and so on as far back as the most recent 1500 tweets (page 15 for rpp=100, or page 30 if we were using rpp=50 (since 15*100 = 30*50 = 1500): http://search.twitter.com/search.json?q=library%20skills&rpp=100&page=1

Clicking on Next provides us with a dialogue that will allow us to load the data from the URLs into Google Refine:

Clicking “Configure Parsing Options” loads the data and provides us with a preview of it:

If you inspect the data that is returned, you should see it has a repeating pattern. Hovering over the various elements allows you to identify what repeating part of the result we want to import. For example, we could just import each tweet:

Or we could import all the data fields – let’s grab them all:

If you click the highlighted text, or click “Update Preview View”, you can get a preview of how the data will appear. To return to the selection view, click “Pick Record Nodes”:

“Create Project” actually generates the project and pulls all the data in… The column names are a little messy, but we can tidy those:

Look for the from_user and to_user columns and rename them source and target respectively… (hovering over a column name pops up tooltip that shows the full column name):

For the example I’m going to describe, we don’t actually need to rename the columns, but it’s handy to know how to do it;-)

We can now filter out all the rows with a “null” value in the target column. It seems a bit fiddly at first, but you soon get used to the procedure… Select the text facet to pop up a window that show the unique elements in the target column and how often they occur. Sort the list by count, and click on the “null” element – it should be highlighted and its setting should appear as “exclude”. The column will now be showing elements in the column that have the null value:

Click on the “Invert” option and the column will now filter out all the “null” elements and only show the elements that have a non-null value – that is, tweets that have a “to_user” value (which is to say, those tweets were sent to a particular user). Here’s what we get:

Let’s now export the source and target data so we can get it into Gephi:

Deselect all the columns, and then select source and target columns; also deselect the ‘output column headers’ – we don’t need headers where this file is going…

Export the custom layout as CSV data:

We can now import this data into another application – Gephi. Gephi is a cross platform package for visualising networks. In the simplest case, it can import two column data files where each row represents two things that are connected to each other. In our case, we have connections between “source” and “target” Twitter names – that is, connections that show when one Twitter user in our search sample has sent a message to another.

Launch Gephi and from the file menu, open the file you exported from Google Refine:

We’ve now got our data into Gephi, where we can start to visualise it…

…but that is a post for another day… (or if you’re impatient, you can find some examples of how to drive Gephi here).

Written by Tony Hirst

October 2, 2012 at 4:45 pm

Posted in Tinkering

Tagged with , , ,

Chit Chat with New Datasets – Facets in OpenRefine (Was /Google Refine/)

One of the many ways of using Google OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various faceting options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a precursor to a full blown conversation.

Faceted search or faceted browsing/navigation typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:

Where the facet relates to a categorical variable – that is, where there is a set of unique values that the facet can take (such as the names of different publishers) – a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For numerical facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical range), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.

In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges across the full range.

So what faceting options does Google Refine provide?

Here’s how they work (data used for the examples comes from Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi and JSON import from the Twitter search API…):

- exploring the set of categories described within a column using the text facet:

Faceted views also allow you to view the facet values by occurrence count, so it’s easy to see which the most popular facet values are:

You can also get a tab separated list of facet values:

Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):

- looking at the range of numerical values contained in a column using the numeric facet:

- looking at the distribution over time of column contents using the timeline facet:

Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a “time” string such as Mon, 29 Oct 2012 10:56:52 +0000 from the Twitter search API, we can use the GREL function toDate(value,"EEE, dd MMM y H:m:s") to create a new column with time-cast elements.

(See GRELDateFunctions and the Java SimpleDateFormat class documentation for more details.)

- getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the scatterplot facet.

This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:

Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:

A range of customisable faceting options are also provided that allow you to define your own faceting functions:

  • the Custom text… facet;
  • the Custom Numeric… facet

More conveniently, a range of predefined Customized facets are provided that provide shortcuts to “bespoke” faceting functions:

So for example:

  • the word facet splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet ngram(value,1) provides a similar (though not identical) result – duplicated words in a cell are identified as unique by the single word ngram function; see also split(value," "), which does seem to replicate the behaviour of the word facet function.)

  • the duplicates facet returns boolean values of true and false; filtering on true values returns all the rows that have duplicated values within a particular column; filtering on false displays all unique rows.
  • the text length facet produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet length(value) achieves something similar; the related measure, word count, can be achieved using the custom numeric facet length(split(value," "))

Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if any of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.

PS On the OpenRefine (was Google Refine) name change, see From Freebase Gridworks to Google Refine and now OpenRefine. The code repository is now on github: OpenRefine Repository. I also notice that openrefine.org/ has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an OpenRefine category that could feed a OpenRefinery aggregator channel).

PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens’ Using Open Refine for e-journal data.

Written by Tony Hirst

November 6, 2012 at 10:39 am

Finding (Nearly) Duplicate Items in a Data Column

[WARNING - THIS IS A *BAD ADVICE* POST - it describes a trick that sort of works, but the example is contrived and has a better solution - text facet and then cluster on facet (h/t to @mhawksey's Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Written by Tony Hirst

November 14, 2012 at 3:09 pm

Geocoding Using the Google Maps Geocoder via OpenRefine

On my to do list in the data education game is an open-ended activity based on collecting examples of how to do common tasks using a variety of different tools and techniques. One such task is geo-coding, getting latitude and longitude co-ordinates for a location based on it’s address or postcode. A couple of recent posts on the School of Data blog provide an Introduction to Geocoding and a howto on geocoding in Google spreadsheets. Here’s another approach, geocoding using OpenRefine [download and install OpenRefine] (see also Paul Bradshaw’s related post from a couple of years ago: Getting Full Addresses for Data from an FOI Response (Using APIs)).

First, we need to grab some data. I’m going to use a data file from the Food Standards Agency that contains food standards ratings for food outlets within a particular local council area. This data file is a relatively simple XML format that OpenRefine should have no trouble handling… Let’s get the data directly from the data file’s web address:

GetSomeDataIntoRefine

OpenRefine will now download the file and give us a preview. What we need to do now is highlight an example of a record element – OpenRefine will create one row for each such record in the original file.

Highlight the record element

(For the purposes of this example, it might have been easier to just select the postcode element within each establishment record.)

Click “Create Project” in the top right corner, and the data will be imported into a tabular view:

wehavedataIncluding postcodes

If we look along the columns, we notice a postcode element- we can use that as the basis for geocoding. (Yes, I know, I know, this dataset already has latitude and longitude data, so we don’t really need to get it for ourselves;-) But I thought this datafile was a handy source of postcodes;-) What we’re going to do now is ask OpenRefine to create a new column that will be populated with data from the Google maps geocoding API that gives the latitude and longitude for each postcode:

weshalladdacolumn

We can pass a postcode to the Google maps geocoder using a URL (web address) constructed as follows:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE"

(We could force the locale – that the postcode is in the UK – by using "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE,UK".)

*IF YOU WANT OPEN GEO-CODED DATA, GOOGLE MAPS WON’T CUT IT… *

Note also that we need to “escape” the postcode so that it works in the URL (this essentially means handling spaces in the postcode). To do this, we will construct a URL using the follow GREL programming language statement:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")

or alternatively:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")+",UK"

What this does is build up the URL from the value of the cells in the column from which our new column is derived. The escape(value,"url") command makes sure that the postcode value is correctly encoded for use in a url.

generate the geocoding URL

We can now let OpenRefine construct a URL containing the postcode for each row in the data file and load the requested data in from the URL (the throttle delay sets the rate at which OpenRefine will make repeated calls to the API – many APIs have a limit on both the rate at which you can call them (once a second, for example), as well as a limit on the total number of calls per hour, or day). The returned data will be added to the appropriate row in the new column:

ecample geo

The looks a bit confusing, doesn’t it? However, it does have some structure to it. Here’s how the data unpacks if we lay it out a bit better:

{ "results" : [
   { "address_components" : [
     { "long_name" : "NG4 4DL",
       "short_name" : "NG4 4DL",
       "types" : [ "postal_code" ] 
     },
     { "long_name" : "Gedling",
       "short_name" : "Gedling",
       "types" : [ "locality", "political" ]
     },
     { "long_name" : "Nottinghamshire",
       "short_name" : "Notts",
       "types" : [ "administrative_area_level_2", "political" ]
     },
     { "long_name" : "United Kingdom",
       "short_name" : "GB", "types" : [ "country", "political" ] 
     },
     { "long_name" : "Nottingham",
       "short_name" : "Nottingham",
       "types" : [ "postal_town" ] 
     }
     ],
     "formatted_address" : "Gedling, Nottinghamshire NG4 4DL, UK",
     "geometry" : {
         "bounds" : {
             "northeast" : { "lat" : 52.97819320, "lng" : -1.09096230 },
             "southwest" : { "lat" : 52.9770690, "lng" : -1.0924360 } 
         },
         "location" : {
           "lat" : 52.97764640,
           "lng" : -1.09182150 
         },
         "location_type" : "APPROXIMATE",
         "viewport" : {
           "northeast" : { "lat" : 52.97898008029150, "lng" : -1.090350169708498 },
           "southwest" : { "lat" : 52.97628211970850, "lng" : -1.093048130291502 }
         }
       },
       "types" : [ "postal_code" ] 
    }
 ], "status" : "OK" }

(For a little more insight into the structure of data like this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure.)

So OpenRefine has pulled back some data – we can now generate another column or two to extract the latitude and longitude the Google Maps geocoding API sent back:

colfromcol

We need to find a way of extracting the data element we want from the data we got back from the geocoding API – we can use another GREL programming language statement to do this:

openrefine jsonparse

Here’s how to construct that statement (try to read the following with reference to the laid out data structure shown above…):

  • value.parseJson() parses the data text (the value of the cell from the original column) into a data element we can work with programmatically
  • value.parseJson().results[0] gets the the data in the first “result” (computer programmes generally start a count with a 0 rather than a 1…)
  • value.parseJson().results[0]["geometry"] says get the “geometry” element from the first result;
  • value.parseJson().results[0]["geometry"]["location"] says get the location element from within that geometry element;;
  • value.parseJson().results[0]["geometry"]["location"]["lat"] says get the latitude element from within the corresponding location element.

Here’s what we get – a new column containing the latitude, as provided by the geocoder API from the postcode:

newcolgeo

Hopefully you can see how we would be able to create a new column containing the longitude: go back to the Googlegeo column, select Edit column – Add column based on this column for a second time, and construct the appropriate parsing statement (value.parseJson().results[0]["geometry"]["location"]["lng"]).

If OpenRefine is new to you, it’s worth reviewing what steps are involved in this recipe:

- loading in an XML data file from a web address/URL;
- generating a set of URLs that call a remote API/data source based on the contents of a particular column, and loading data from those URLs into a new column;
- parsing the data returned from the API to extract a particular data element and adding it to a new column.

Many of these steps are reusable in other contexts, of course…;-) (Can you think of any?!)

PS Tom Morris pointed out to me (on the School of Data mailing list) that the Google Maps geocoding API terms of use only allow the use of the API for geocoding points in order to locate them on a Google Map… Which is probably not what you’d want this data for…;-) However, the recipe is a general one – if you can find another geocoding service that accepts an escaped address term as part of a URL and returns geo-data in a JSON format, you should be able to work out how to use that API instead using this self-same recipe… (Just build a different URL and parse the returned JSON appropriately…)

I’ll try to keep a list of example API URLs for more open geocoders here:

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London (no postcodes?)
http://mapit.mysociety.org/postcode/MK7%206AA (UK postcode geocoder)

Written by Tony Hirst

February 20, 2013 at 10:40 am

First Dabblings with the Gateway to Research API Using OpenRefine

Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.

So let’s do a search for the Open University, and limit by organisation:

Gtr Open university search

Here’s the OU’s organisation “homepage” on GtR:

gtr ou homepage

A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24

Just add a .json or .xml suffix to get a link to a machine readable version of the data:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json

(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)

Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json?page=1&selectedSortableField=startDate&selectedSortOrder=DESC&fetchSize=100

(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)

To grab additional pages, simply increase the page count.

In OpenRefine, we can load the JSON in from four URLs:

OpenRefine import data

Yep – looks like the data has been grabbed okay…

open refine - the data has been grabbed

We can now select which blocks of data we want to import as unique rows…

open refine  - identify the import records

OpenRefine nows shows a preview – looks okay to me…

Open Refine - check the preview

Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:

Open refine Cast to date

We can then use a time facet to explore the distribution of bids with a particular start date, for example:

Open Refine - timeline facet

Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)

What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:

open refine funding facet

One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.

How about the funders – and maybe the number of projects funder? A text facet can help us there:

open refine text facet

Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.

So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.

If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):

gtr - project data

The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.

Open Refine - add col by URL

We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:

open refine scraper downloader

Data down:-) Now we need to parse it and create columns based on the project data download:

open refine - data down - need to parse it

One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):

gtr project json

Here’s how we can pull out the Grant Reference:

open refine parse json

Let’s briefly unpick that:

value.parseJson()['projectComposition']['project']['grantReference']

The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)

Here are a few more incantations we can cast to grab particular data elements out:

  • the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
  • the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']

open refine new cols

We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:

forEach(filter(value.parseJson()['projectComposition']['projectPerson'],v,v['projectRole'][0]=='PRINCIPAL_INVESTIGATOR'),x,x['name']).join('|')

We can also pull out a list of the Organisations associated with the project:

forEach(value.parseJson()['projectComposition']['organisation'],v,v['name']).join('|')

open refine - more data extracts

Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.

The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.

open refine split multi valued cells

We need to split on the | delimiter I specified in the organisations related expression above.

open refine split separated

The result of the split are new rows – but lots of empty columns…

open refien the result of split

We can fill in the values by using “Fill Down” on columns we want to be populated:

open refine fill dwon

Magic:

open refine filled

I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.

The final step is to export out data, using a custom export format:

open refine cusotm export

This exporter allows us to select – and order – the columns we want to export:

open refine exporter col select

We can also select the output format:

open refine export format select

Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:

Importing into R

and then viewing it:

data in R

NOTE: I haven’t checked any of this, just blog walked through it..!;-)

Written by Tony Hirst

March 14, 2013 at 5:43 pm

Follow

Get every new post delivered to your Inbox.

Join 725 other followers