First Dabblings with the Gateway to Research API Using OpenRefine

Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.

So let’s do a search for the Open University, and limit by organisation:

Gtr Open university search

Here’s the OU’s organisation “homepage” on GtR:

gtr ou homepage

A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24

Just add a .json or .xml suffix to get a link to a machine readable version of the data:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json

(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)

Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json?page=1&selectedSortableField=startDate&selectedSortOrder=DESC&fetchSize=100

(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)

To grab additional pages, simply increase the page count.

In OpenRefine, we can load the JSON in from four URLs:

OpenRefine import data

Yep – looks like the data has been grabbed okay…

open refine - the data has been grabbed

We can now select which blocks of data we want to import as unique rows…

open refine  - identify the import records

OpenRefine nows shows a preview – looks okay to me…

Open Refine - check the preview

Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:

Open refine Cast to date

We can then use a time facet to explore the distribution of bids with a particular start date, for example:

Open Refine - timeline facet

Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)

What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:

open refine funding facet

One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.

How about the funders – and maybe the number of projects funder? A text facet can help us there:

open refine text facet

Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.

So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.

If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):

gtr - project data

The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.

Open Refine - add col by URL

We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:

open refine scraper downloader

Data down:-) Now we need to parse it and create columns based on the project data download:

open refine - data down - need to parse it

One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):

gtr project json

Here’s how we can pull out the Grant Reference:

open refine parse json

Let’s briefly unpick that:

value.parseJson()['projectComposition']['project']['grantReference']

The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)

Here are a few more incantations we can cast to grab particular data elements out:

  • the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
  • the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']

open refine new cols

We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:

forEach(filter(value.parseJson()['projectComposition']['projectPerson'],v,v['projectRole'][0]=='PRINCIPAL_INVESTIGATOR'),x,x['name']).join('|')

We can also pull out a list of the Organisations associated with the project:

forEach(value.parseJson()['projectComposition']['organisation'],v,v['name']).join('|')

open refine - more data extracts

Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.

The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.

open refine split multi valued cells

We need to split on the | delimiter I specified in the organisations related expression above.

open refine split separated

The result of the split are new rows – but lots of empty columns…

open refien the result of split

We can fill in the values by using “Fill Down” on columns we want to be populated:

open refine fill dwon

Magic:

open refine filled

I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.

The final step is to export out data, using a custom export format:

open refine cusotm export

This exporter allows us to select – and order – the columns we want to export:

open refine exporter col select

We can also select the output format:

open refine export format select

Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:

Importing into R

and then viewing it:

data in R

NOTE: I haven’t checked any of this, just blog walked through it..!;-)

Geocoding Using the Google Maps Geocoder via OpenRefine

On my to do list in the data education game is an open-ended activity based on collecting examples of how to do common tasks using a variety of different tools and techniques. One such task is geo-coding, getting latitude and longitude co-ordinates for a location based on it’s address or postcode. A couple of recent posts on the School of Data blog provide an Introduction to Geocoding and a howto on geocoding in Google spreadsheets. Here’s another approach, geocoding using OpenRefine [download and install OpenRefine] (see also Paul Bradshaw’s related post from a couple of years ago: Getting Full Addresses for Data from an FOI Response (Using APIs)).

First, we need to grab some data. I’m going to use a data file from the Food Standards Agency that contains food standards ratings for food outlets within a particular local council area. This data file is a relatively simple XML format that OpenRefine should have no trouble handling… Let’s get the data directly from the data file’s web address:

GetSomeDataIntoRefine

OpenRefine will now download the file and give us a preview. What we need to do now is highlight an example of a record element – OpenRefine will create one row for each such record in the original file.

Highlight the record element

(For the purposes of this example, it might have been easier to just select the postcode element within each establishment record.)

Click “Create Project” in the top right corner, and the data will be imported into a tabular view:

wehavedataIncluding postcodes

If we look along the columns, we notice a postcode element- we can use that as the basis for geocoding. (Yes, I know, I know, this dataset already has latitude and longitude data, so we don’t really need to get it for ourselves;-) But I thought this datafile was a handy source of postcodes;-) What we’re going to do now is ask OpenRefine to create a new column that will be populated with data from the Google maps geocoding API that gives the latitude and longitude for each postcode:

weshalladdacolumn

We can pass a postcode to the Google maps geocoder using a URL (web address) constructed as follows:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE"

(We could force the locale – that the postcode is in the UK – by using "http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address=POSTCODE,UK".)

*IF YOU WANT OPEN GEO-CODED DATA, GOOGLE MAPS WON’T CUT IT… *

Note also that we need to “escape” the postcode so that it works in the URL (this essentially means handling spaces in the postcode). To do this, we will construct a URL using the follow GREL programming language statement:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")

or alternatively:

"http://maps.googleapis.com/maps/api/geocode/json?sensor=false&address="+escape(value,"url")+",UK"

What this does is build up the URL from the value of the cells in the column from which our new column is derived. The escape(value,"url") command makes sure that the postcode value is correctly encoded for use in a url.

generate the geocoding URL

We can now let OpenRefine construct a URL containing the postcode for each row in the data file and load the requested data in from the URL (the throttle delay sets the rate at which OpenRefine will make repeated calls to the API – many APIs have a limit on both the rate at which you can call them (once a second, for example), as well as a limit on the total number of calls per hour, or day). The returned data will be added to the appropriate row in the new column:

ecample geo

The looks a bit confusing, doesn’t it? However, it does have some structure to it. Here’s how the data unpacks if we lay it out a bit better:

{ "results" : [
   { "address_components" : [
     { "long_name" : "NG4 4DL",
       "short_name" : "NG4 4DL",
       "types" : [ "postal_code" ] 
     },
     { "long_name" : "Gedling",
       "short_name" : "Gedling",
       "types" : [ "locality", "political" ]
     },
     { "long_name" : "Nottinghamshire",
       "short_name" : "Notts",
       "types" : [ "administrative_area_level_2", "political" ]
     },
     { "long_name" : "United Kingdom",
       "short_name" : "GB", "types" : [ "country", "political" ] 
     },
     { "long_name" : "Nottingham",
       "short_name" : "Nottingham",
       "types" : [ "postal_town" ] 
     }
     ],
     "formatted_address" : "Gedling, Nottinghamshire NG4 4DL, UK",
     "geometry" : {
         "bounds" : {
             "northeast" : { "lat" : 52.97819320, "lng" : -1.09096230 },
             "southwest" : { "lat" : 52.9770690, "lng" : -1.0924360 } 
         },
         "location" : {
           "lat" : 52.97764640,
           "lng" : -1.09182150 
         },
         "location_type" : "APPROXIMATE",
         "viewport" : {
           "northeast" : { "lat" : 52.97898008029150, "lng" : -1.090350169708498 },
           "southwest" : { "lat" : 52.97628211970850, "lng" : -1.093048130291502 }
         }
       },
       "types" : [ "postal_code" ] 
    }
 ], "status" : "OK" }

(For a little more insight into the structure of data like this, see Tech Tips: Making Sense of JSON Strings – Follow the Structure.)

So OpenRefine has pulled back some data – we can now generate another column or two to extract the latitude and longitude the Google Maps geocoding API sent back:

colfromcol

We need to find a way of extracting the data element we want from the data we got back from the geocoding API – we can use another GREL programming language statement to do this:

openrefine jsonparse

Here’s how to construct that statement (try to read the following with reference to the laid out data structure shown above…):

  • value.parseJson() parses the data text (the value of the cell from the original column) into a data element we can work with programmatically
  • value.parseJson().results[0] gets the the data in the first “result” (computer programmes generally start a count with a 0 rather than a 1…)
  • value.parseJson().results[0][“geometry”] says get the “geometry” element from the first result;
  • value.parseJson().results[0][“geometry”][“location”] says get the location element from within that geometry element;;
  • value.parseJson().results[0][“geometry”][“location”][“lat”] says get the latitude element from within the corresponding location element.

Here’s what we get – a new column containing the latitude, as provided by the geocoder API from the postcode:

newcolgeo

Hopefully you can see how we would be able to create a new column containing the longitude: go back to the Googlegeo column, select Edit column – Add column based on this column for a second time, and construct the appropriate parsing statement (value.parseJson().results[0][“geometry”][“location”][“lng”]).

If OpenRefine is new to you, it’s worth reviewing what steps are involved in this recipe:

– loading in an XML data file from a web address/URL;
– generating a set of URLs that call a remote API/data source based on the contents of a particular column, and loading data from those URLs into a new column;
– parsing the data returned from the API to extract a particular data element and adding it to a new column.

Many of these steps are reusable in other contexts, of course…;-) (Can you think of any?!)

PS Tom Morris pointed out to me (on the School of Data mailing list) that the Google Maps geocoding API terms of use only allow the use of the API for geocoding points in order to locate them on a Google Map… Which is probably not what you’d want this data for…;-) However, the recipe is a general one – if you can find another geocoding service that accepts an escaped address term as part of a URL and returns geo-data in a JSON format, you should be able to work out how to use that API instead using this self-same recipe… (Just build a different URL and parse the returned JSON appropriately…)

I’ll try to keep a list of example API URLs for more open geocoders here:

http://open.mapquestapi.com/nominatim/v1/?format=json&q=London (no postcodes?)
http://mapit.mysociety.org/postcode/MK7%206AA (UK postcode geocoder)

Finding (Nearly) Duplicate Items in a Data Column

[WARNING – THIS IS A *BAD ADVICE* POST – it describes a trick that sort of works, but the example is contrived and has a better solution – text facet and then cluster on facet (h/t to @mhawksey’s Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Chit Chat with New Datasets – Facets in OpenRefine (Was /Google Refine/)

One of the many ways of using Google OpenRefine is as a toolkit for getting a feel for the range of variation contained within a dataset using the various faceting options. In the sense of analysis being a conversation with data, this is a bit like an idle chit-chat/getting to know you phase, as a precursor to a full blown conversation.

Faceted search or faceted browsing/navigation typically provides a set of limiting search filters to a set of search results that limits or restricts the displayed results to ones that fulfil certain conditions. In a library catalogue, the facets might refer to metadata fields such as publication date, thus allowing a user to search within a given date range, or publisher:

Where the facet relates to a categorical variable – that is, where there is a set of unique values that the facet can take (such as the names of different publishers) – a view of the facet values will show the names of the different publishers extracted from the original search results. Selecting a particular publisher, for example, will then limit the displayed results to just those results associated with that publisher. For numerical facets, where the quantities associated with the facet related to a number or date (that is, a set of things that have a numerical range), the facet view will show the full range of values contained within that particular facet. The user can then select a subset of results that fall within a specified part of that range.

In the case of Open Refine, facets can be defined on a per column basis. For categorical facets, Refine will identify the set of unique values associated with a particular faceted view that are contained within a column, along with a count of how many times each facet value occurs throughout the column. The user can then choose to view only those rows with a particular (facet selected) value in the faceted column. For columns that contain numbers, Refine will generate a numerical facet that spans the range of values contained within the column, along with a histogram that provides a count of occurrences of numbers within small ranges across the full range.

So what faceting options does Google Refine provide?

Here’s how they work (data used for the examples comes from Even Wholesale Drug Dealers Can Use a Little Retargeting: Graphing, Clustering & Community Detection in Excel and Gephi and JSON import from the Twitter search API…):

– exploring the set of categories described within a column using the text facet:

Faceted views also allow you to view the facet values by occurrence count, so it’s easy to see which the most popular facet values are:

You can also get a tab separated list of facet values:

Sometimes it can be useful to view rows associated with particular facet values that occur a particular number of times, particulalry at the limits (for example, very popular facet values, or uniquely occurring facet values):

– looking at the range of numerical values contained in a column using the numeric facet:

– looking at the distribution over time of column contents using the timeline facet:

Faceting by time requires time-related strings to be parsed as such; sometimes, Refine needs a little bit of help in interpreting an imported string as a time string. So for example, given a “time” string such as Mon, 29 Oct 2012 10:56:52 +0000 from the Twitter search API, we can use the GREL function toDate(value,"EEE, dd MMM y H:m:s") to create a new column with time-cast elements.

(See GRELDateFunctions and the Java SimpleDateFormat class documentation for more details.)

– getting a feel for the correlation of values across numerical columns, and exploring those correlations further, using the scatterplot facet.

This generates a view that generates a set of scatterplots relating to pairwise combinations of all the numerical columns in the dataset:

Clicking on one of these panels allows you to filter points within a particular area of the corresponding scatter chart (click and drag a rectangular area over the points you want to view), effectively allowing you to filter the data across related ranges of two numerical columns at the same time:

A range of customisable faceting options are also provided that allow you to define your own faceting functions:

  • the Custom text… facet;
  • the Custom Numeric… facet

More conveniently, a range of predefined Customized facets are provided that provide shortcuts to “bespoke” faceting functions:

So for example:

  • the word facet splits strings contained in cells into single words, counts their occurrences throughout the column, and then lists unique words and their occurrence count in the facet panel. This faceting option thus provides a way of selecting rows where the contents of a particular column contain one or more specified words. (The user defined GREL custom text facet ngram(value,1) provides a similar (though not identical) result – duplicated words in a cell are identified as unique by the single word ngram function; see also split(value," "), which does seem to replicate the behaviour of the word facet function.)
  • the duplicates facet returns boolean values of true and false; filtering on true values returns all the rows that have duplicated values within a particular column; filtering on false displays all unique rows.
  • the text length facet produces a facet based on the character count(?) of strings in cells within the faceted column; the custom numeric facet length(value) achieves something similar; the related measure, word count, can be achieved using the custom numeric facet length(split(value," "))

Note that facet views can be combined. Selecting multiple rows within a particular facet panel provides a Boolean OR over the selected values (that is, if any of the selected values appear in the column, the corresponding rows will be displayed). To AND conditions, even within the same facet, create a separate facet panel for each ANDed condition.

PS On the OpenRefine (was Google Refine) name change, see From Freebase Gridworks to Google Refine and now OpenRefine. The code repository is now on github: OpenRefine Repository. I also notice that openrefine.org/ has been minted and is running a placeholder instance of WordPress. I wonder if it would be worth setting up an aggregator for community posts, a bit like R-Blogger (for example, I have an RStats category feed from this blog that I syndicate to the RBloggers aggregator, and have just created an OpenRefine category that could feed a OpenRefinery aggregator channel).

PPS for an example of using OpenRefine to find differences between two recordsets, see Owen Stephens’ Using Open Refine for e-journal data.