First Dabblings with the Gateway to Research API Using OpenRefine
Quick notes from a hackday exploring the Research Councils UK Gateway to Research, a search tool and API over project data relating to UK Research Council project funding awards made since 2006. As an opener, here’s a way of grabbing data related to a particular institution (the OU, for example) using Open Refine, having a quick peek at it, and then pulling down and processing data related to each of the itemised grant awards.
So let’s do a search for the Open University, and limit by organisation:
Here’s the OU’s organisation “homepage” on GtR:
A unique ID associated with the OU, combined with the organisation path element defines the URL for the organisation page:
Just add a .json or .xml suffix to get a link to a machine readable version of the data:
(Why don’t the HTML pages also include a link to the corresponding JSON or XML versions of the page? As Chris Gutteridge would probably argue for, why don’t the HTML pages also support the autodiscovery of machine friendly versions of the page?)
Noting that there are of the order 400 results for projects associated with the OU, and further that the API returns paged results containing at most 100 results per page, we can copy from the browser address/location bar a URL to pull back 100 results, and tweak it to grab the JSON):
(You need to “de-end-slash” the URL – i.e. remove the / immediately before the “?”.)
To grab additional pages, simply increase the page count.
In OpenRefine, we can load the JSON in from four URLs:
Yep – looks like the data has been grabbed okay…
We can now select which blocks of data we want to import as unique rows…
OpenRefine nows shows a preview – looks okay to me…
Having got the data, we can take the opportunity to explore it. If we cast the start and end date columns to date types:
We can then use a time facet to explore the distribution of bids with a particular start date, for example:
Hmmm – no successful bids through much of 2006, and none of 2007? Maybe the data isn’t all there yet? (Or maybe there is another identifier minted to the OU and the projects from the missing period are associated with that?)
What else…? How about looking at the distribution of funding amounts – we can use a numeric facet for that:
One of the nice features of Open Refine is that we can use the numeric facet sliders to limit the displayed rows to show only projects with start date in a particular range, for example, and/or project funding values within a particular range.
How about the funders – and maybe the number of projects funder? A text facet can help us there:
Again, we can use the automatically generated facet control to then filter rows to show only projects funded by the EPSRC, for example. Alternatively, in conjunction with other facet controls, we might limit the rows displayed to only those projects funded for less than £20,000 by the ESRC during 2010-11, and so on.
So that’s one way of using Open Refine – to grab and then familiarise ourselves with a dataset. We might also edit the column names to something more meaningful, and then export the data (as a CSV or Excel file, for example) so that we don’t have to grab the data again.
If we return to the Gateway to Research, we notice that each project has it’s own page too (we might also have noticed this from the URL column in the data we downloaded):
The data set isn’t a huge one, so we should be okay using OpenRefine as an automated downloader, or scraper, pulling down the project data for each OU project.
We need to remember to tweak the contents of the URL column, which currently points to the project HTML page, rather than the JSON data corresponding to that page:
Data down:-) Now we need to parse it and create columns based on the project data download:
One thing it might be useful to have is the project’s Grant Reference number. This requires a bit of incantation… Here’s what the structure of the JSON looks like (in part):
Here’s how we can pull out the Grant Reference:
Let’s briefly unpick that:
The first part – value.parseJson() – gets the text string represented as a JSON object tree. Then we can start to climb down the tree from its root (the root is at the top of the tree…;-)
Here are a few more incantations we can cast to grab particular data elements out:
- the abstract text: value.parseJson()['projectComposition']['project']['abstractText']
- the name of the lead research organisation: value.parseJson()['projectComposition']['leadResearchOrganisation']['name']
We can also do slightly more elaborate forms of parsing… For example, we can pull out the names of people identified as PRINCIPAL INVESTIGATOR:
We can also pull out a list of the Organisations associated with the project:
Noting that some projects have more than one organisation associated with it, we might want to get a form of the data in which each row only lists a single organisation, the other column values being duplicated for the same project.
The Open Refine “Split Multi-Valued Cells” function will generate new rows from rows where the organisations column contains more than one organisation.
We need to split on the | delimiter I specified in the organisations related expression above.
The result of the split are new rows – but lots of empty columns…
We can fill in the values by using “Fill Down” on columns we want to be populated:
I can use the same Fill Down trick to populate blank rows in the Grant Reference, PI and abstract columns, for example.
The final step is to export out data, using a custom export format:
This exporter allows us to select – and order – the columns we want to export:
We can also select the output format:
Once downloaded, we can import the data into other tools. Here’s a peak at loading it into RStudio:
and then viewing it:
NOTE: I haven’t checked any of this, just blog walked through it..!;-)