OUseful.Info, the blog…

Trying to find useful things to do with emerging technologies in open education

First Dabblings with the Gateway to Research API Using OpenRefine

Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.

So let’s do a search for the Open University, and limit by organisation:

Gtr Open university search

Here’s the OU’s organisation “homepage” on GtR:

gtr ou homepage

A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24

Just add a .json or .xml suffix to get a link to a machine readable version of the data:

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json

(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)

Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):

http://gtr.rcuk.ac.uk/organisation/89E6D9CB-DAF8-40A2-A9EF-B330A5A7FC24.json?page=1&selectedSortableField=startDate&selectedSortOrder=DESC&fetchSize=100

(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)

To grab additional pages, simply increase the page count.

In OpenRefine, we can load the JSON in from four URLs:

OpenRefine import data

Yep – looks like the data has been grabbed okay…

open refine - the data has been grabbed

We can now select which blocks of data we want to import as unique rows…

open refine  - identify the import records

OpenRefine nows shows a preview – looks okay to me…

Open Refine - check the preview

Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:

Open refine Cast to date

We can then use a time facet to explore the distribution of bids with a particular start date, for example:

Open Refine - timeline facet

Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)

What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:

open refine funding facet

One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.

How about the funders – and maybe the number of projects funder? A text facet can help us there:

open refine text facet

Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.

So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.

If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):

gtr - project data

The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.

Open Refine - add col by URL

We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:

open refine scraper downloader

Data down:-) Now we need to parse it and create columns based on the project data download:

open refine - data down - need to parse it

One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):

gtr project json

Here’s how we can pull out the Grant Reference:

open refine parse json

Let’s briefly unpick that:

value.parseJson()['projectComposition']['project']['grantReference']

The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)

Here are a few more incantations we can cast to grab particular data elements out:

  • the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
  • the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']

open refine new cols

We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:

forEach(filter(value.parseJson()['projectComposition']['projectPerson'],v,v['projectRole'][0]=='PRINCIPAL_INVESTIGATOR'),x,x['name']).join('|')

We can also pull out a list of the Organisations associated with the project:

forEach(value.parseJson()['projectComposition']['organisation'],v,v['name']).join('|')

open refine - more data extracts

Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.

The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.

open refine split multi valued cells

We need to split on the | delimiter I specified in the organisations related expression above.

open refine split separated

The result of the split are new rows – but lots of empty columns…

open refien the result of split

We can fill in the values by using “Fill Down” on columns we want to be populated:

open refine fill dwon

Magic:

open refine filled

I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.

The final step is to export out data, using a custom export format:

open refine cusotm export

This exporter allows us to select – and order – the columns we want to export:

open refine exporter col select

We can also select the output format:

open refine export format select

Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:

Importing into R

and then viewing it:

data in R

NOTE: I haven’t checked any of this, just blog walked through it..!;-)

Written by Tony Hirst

March 14, 2013 at 5:43 pm

One Response

Subscribe to comments with RSS.

  1. [...] done a little digging before around university funding (e.g. and First Dabblings with the Gateway to Research API Using OpenRefine), though as mentioned in Public Sector Transparency – Do We Need Open Receipts Data as Well as [...]


Comments are closed.

Follow

Get every new post delivered to your Inbox.

Join 820 other followers

%d bloggers like this: