Opening Up Access to Data: Why APIs May Not Be Enough…

Last week, a post on the ONS (Office of National Statistics) Digital Publishing blog caught my eye: Introducing the New Improved ONS API which apparently “mak[es] things much easier to work with”.

Ooh… exciting…. maybe I can use this to start hacking together some notebooks?:-)

It was followed a few days later by this one – ONS-API, Just the Numbers which described “a simple bit of code for requesting some data and then turning that into ‘just the raw numbers’” – a blog post that describes how to get a simple statistic, as a number, from the API. The API that “mak[es] things much easier to work with”.

After a few hours spent hacking away over the weekend, looking round various bits of the API, I still wasn’t really in a position to discover where to find the numbers, let alone get numbers out of the API in a reliable way. (You can see my fumblings here.) Note that I’m happy to be told I’m going about this completely the wrong way and didn’t find the baby steps guide I need to help me use it properly.

So FWIW, here are some reflections, from a personal standpoint, about the whole API thing from the perspective of someone who couldn’t get it together enough to get the thing working …


Most data users aren’t programmers. And I’m not sure how many programmers are data junkies, let alone statisticians and data analysts.

For data users who do dabble with programming – in R, for example, or python (for example, using the pandas library) – the offer of an API is often seen as providing a way of interrogating a data source and getting the bits of data you want. The alternative to this is often having to download a huge great dataset yourself and then querying it or partitioning it yourself to get just the data elements you want to make use of (for example, Working With Large Text Files – Finding UK Companies by Postcode or Business Area).

That’s fine, insofar as it goes, but it starts to give the person who wants to do some data analysis a data management problem too. And for data users who aren’t happy working with gigabyte data files, it can sometimes be a blocker. (Big file downloads also take time, and incur bandwidth costs.)

For me, a stereotypical data user might be someone who typically wants to be able to quickly and easily get just the data they want from the API into a data representation that is native to the environment they are working in, and that they are familiar with working with.

This might be a spreadsheet user or it might be a code (R, pandas etc) user.

In the same way that spreadsheet users want files in XLS or CSV format that they can easily open, (formats that can be also be directly opened into appropriate data structures in R or pandas), I increasingly look not for APIs, but for API wrappers, that bring API calls and the results from them directly into the environment I’m working in in a form appropriate to that environment.

So for example, in R, I make use of the FAOstat package, which also offers an interface to the World Bank Indicators datasets. In pandas, a remote data access handler for the World Bank Indicators portal allows me to make simple requests for that data.

At a level up (or should that be “down”?) from the API wrapper are libraries that parse typical response formats. For example, Statistics Norway seem to publish data using the json-stat format, the format used in the new ONS API update. This IPython notebook shows how to use the pyjstat python package to parse the json-stat data directly into a pandas dataframe (I couldn’t get it to work with the ONS data feed – not sure if the problem was me, the package, or the data feed; which is another problem – working out where the problem is…). For parsing data returned from SPARQL Linked Data endpoints, packages such as SPARQLwrapper get the data into Python dicts, if not pandas dataframes directly. (A SPARQL i/o wrapper for pandas could be quite handy?)

At the user level, IPython Notebooks (my current ‘can be used to solve all known problems’ piece of magic tech!;-) provide a great way of demonstrating not just how to get started with an API, but also encourage the development within the notebook or reusable components, as well as demonstrations of how to use the data. The latter demonstrations have the benefit of requiring that the API demo does actually get the data into a form that is useable within the environment. It also helps folk see what it means to be able to get data into the environment (it means you can do things like the things done in the demo…; and if you can do that, then you can probably also do other related things…)

So am I happy when I see APIs announced? Yes and no… I’m more interested in having API wrappers available within my data wrangling environment. If that’s a fully blown wrapper, great. If that sort of wrapper isn’t available, but I can use a standard data feed parsing library to parse results pulled from easily generated RESTful URLs, I can just about work out how to create the URLs, so that’s not too bad either.

When publishing APIs, it’s worth considering who can address them and use them. Just because you publish a data API doesn’t mean a data analyst can necessarily use the data, because they may not be (are likely not to be) a programmer. And if ten, or a hundred, or a thousand potential data users all have to implement the same sort of glue code to get the data from the API into the same sort of analysis environment, that’s not necessarily efficient either. (Data users may feel they can hack some code to get the data from the API into the environment for their particular use case, but may not be willing to release it as a general, tested and robust API wrapper, certainly not a stable production level one.)

This isn’t meant to be a slight against the ONS API, more a reflection on some of the things I was thinking as I hacked my weekend away…

PS I don’t know how easy it is to run Python code in R, but the R magic in IPython notebooks supports the running of R code within a notebook running a Python kernel, with the handing over of data from R dataframes to python dataframes. Which is to say, if there’s an R package available, for someone who can run R via an IPython context, it’s available via python too.

PPS I notice that from some of the ONS API calls we can get links to URLs of downloadable datasets (though when I tried some of them, I got errors trying to unzip the results). This provides an intermediate way of providing API access to a dataset – search based API calls that allow discovery of a dataset, then the download and automatic unpacking of that dataset into a native data representation, such as one or more data frames.

JSON Data Goodness on the new ONS (Office for National Statistics) Website

Via the @ONSDigital blog (http://blog.ons.digital/2016/02/25/new-ons-website-launched/, it seems that the new Office for National Statistics website, which publishes the UK’s official government statistics, has just been released in all its glory.

One thing I noticed was that it’s now trivial to get hold of data, in JSON format, from a statistics page by adding /data to the end of a stats page URL.

For example, the data behind the Consumer Price Index page:

https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/d7g7

can be found at:

https://www.ons.gov.uk/economy/inflationandpriceindices/timeseries/d7g7/data

Data can also be downloaded in CSV and XLS formats using another derivable URL. For example, to download the data as CSV:

http://www.ons.gov.uk/generator?format=csv&uri=/economy/inflationandpriceindices/timeseries/d7g7

Use the value xls rather than csv to get the spreadsheet file.

Another nice feature is that the alphabetic listing of statistics:

http://www.ons.gov.uk/atoz?az=a

can also be accessed in a JSON data format by adding the /data path element:

http://www.ons.gov.uk/atoz/data?az=a

(Paged) search results can also be returned in JSON format:
https://www.ons.gov.uk/search/data?q=unemployment&page=12

Search results can also be narrowed down (following the tabs on the search results HTML page) from All to Data or Publications:

http://www.ons.gov.uk/searchdata/data?q=unemployment

http://www.ons.gov.uk/searchpublication/data?q=unemployment

It’s also possible to get time series charts as image URLs of the form:

https://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/timeseries/lf24/linechartimage?series=&fromMonth=01&fromYear=1971&toMonth=12&toYear=2015&frequency=months

(The date range filters don’t seem to work if applied to the /data URLs?)

Now I just need some time to play!

PS I’ve just popped a gist containing some python code that represents a first start at grabbing the data from the JSON feeds here: https://gist.github.com/psychemedia/ca7b981f2bbd45377b44

More Observations on the ONS JSON Feeds – Returning Bulletin Text as Data

Whilst starting to sketch out some python functions for grabbing the JSON data feeds from the new ONS website, I also started wondering how I might be able to make use of them in a simple slackbot that could provide a crude conversational interface to some of the ONS stats.

(To this end, it would also be handy to see some ONS search logs to see what sort of things folk search – and how they phrase their searches…)

One of the ways of using the data is as the basis for some simple data2text scripts, that can report the outcomes of some simple canned analyses of the data (comparing the latest figures with those from the previous month, or a year ago, for example). But the ONS also produce commentary on various statistics for via their statistical bulletins – and it seems that these, too, are available in JSON form simply by adding /data to the end of the IRL path as before:

UK_Labour_Market_-_Office_for_National_Statistics

One thing to note is that whist the HTML view of bulletins can include a name element to focus the page on a particular element:

http://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/#comparison-between-unemployment-and-the-claimant-count

the name attribute switch doesn’t work to filter the JSON output to that element (though it would be easy enough to script a JSON handler to return that focus) so there’s no point adding it to the JSON feed URL:

http://www.ons.gov.uk/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/data

One other thing to note about the JSON feed is that it contains cross-linked elements for items such as charts and tables. If you look closely at the above screenshot, you’ll see it contains a reference to an ons-table.

...
sections: [
...
{
title: "Summary of latest labour market statistics",
markdown: "Table A shows the latest estimates, for October to December 2015, for employment, unemployment and economic inactivity. It shows how these estimates compare with the previous quarter (July to September 2015) and the previous year (October to December 2014). Comparing October to December 2015 with July to September 2015 provides the most robust short-term comparison. Making comparisons with earlier data at Section (ii) has more information. <ons-table path="cea716cc" /> Figure A shows a more detailed breakdown of the labour market for October to December 2015. <ons-image path="718d6bbc" />"
},
...
]
...

This resource is then described in detail elsewhere in the data feed linked by the same ID value:

www_ons_gov_uk_employmentandlabourmarket_peopleinwork_employmentandemployeetypes_bulletins_uklabourmarket_february2016_data_comparison-between-unemployment-and-the-claimant-count

...
tables: [
{
title: "Table A: Summary of UK labour market statistics for October to December 2015, seasonally adjusted",
filename: "cea716cc",
uri: "/employmentandlabourmarket/peopleinwork/employmentandemployeetypes/bulletins/uklabourmarket/february2016/cea716cc"
}
],
...

Images are identified via the ons-image tag, charts via the ons-chart tag, and so on.

So now I’m thinking – maybe this is the place to start thinking about a simple conversational UI? Something that can handle simple references into different parts of a bulletin, and return the ONS text as the response?

Chatting With ONS Data Via a Simple Slack Bot

A recent post on the ONS Digital blog – Dueling with datasets – describes some of the design decisions taken when putting together the new Office for National Statistics website (such as having a single page for a particular measure that would provide the current figures at the top as well as historical figures further down the page) and some of the challenges still facing the team (such as the language and titling used to describe the statistics).

The emphasis is still very much on publishing the data via a website, however, which promotes two particular sorts of interaction style: browse and search. Via Laura Dewis (Deputy Director, Digital Publishing at Office for National Statistics, and ex- of the OpenLearn parish), I got a peek at some of the popular search terms used on the pre-updated website, which suggest (to me) a mix of vernacular keyword search terms as well as official terms (for example, rpi, baby names, cpi, gdp, retail price index, population, Labour Market Statistics unemployment, inflation, labour force survey).

Over the last couple of years, regular readers will have noticed that I’ve been dabbling with some simple data2text conversions, as well as dipping my toes into some simple custom slackbots (that is, custom slack robots…) capable of responding to simple parameterised queries with texts automatically generated from online data sources (for example, querying the Nomis JSA figures as part of a Slackbot Data Wire, Initial Sketch or my First Steps in a Conversational Slackbot interface to CQC Inspection Data ).

I’m still fumbling around how best to try to put these bots together. On the one hand is trying to work out what sorts of thing we might want to ask of the data, as well as how we might actually ask for it in natural language terms. On the other, is generating queries over the data, and figuring out how to provide the response (creating a canned text around the results from a data query).

But what if there was already a ready source of text interpreting particular datasets that could be used as the response part of a conversational data agent? Then all we’d have to focus on would be parsing queries and matching them to the texts?

A couple of weeks ago, when the new ONS website came out of beta, the human facing web pages were complemented with a data view in the form of JSON feeds that mirrored the HTML text (I don’t know if the HTML is actually generated from the JSON feeds?), as described in More Observations on the ONS JSON Feeds – Returning Bulletin Text as Data. So here we have a ready source of data interpreting text that we may be able to use to provide a backend to a conversational UI to the ONS content. (Whether or not the text is human generated or machine generated is irrelevant – though it does also provide a useful model for developing and testing my own data to text routines!)

So let’s see… it being to wet to go and dig the vegetable patch yesterday, I thought I’d have a quick play trying to put together some simple response rules, in part building on some of the ONS JSON parsing code I started putting together following the ONS website refresh.

Here’s a snapshot of where I’m at…

Firstly, asking for a summary of some popular recent figures:

dtest___OUseful_Slack_1

The latest figures are assumed for some common keyword driven queries. We can also ask for a chart:

dtest___OUseful_Slack_2

The ONS publish different sorts of product that can be filtered against:

rate_-_Search_-_Office_for_National_Statistics

So for example, we can run a search to find what bulletins are available on a particular topic:

dtest___OUseful_Slack_3

(For some reason, the markdown isn’t being interpreted as such?)

We can then go on to ask about a particular bulletin, and get the highlights from it:

dtest___OUseful_Slack_4

(I did wonder about numbering the items in the list, retaining the state of the previous response in the bot, and then allowing an interaction along the lines of “tell me more about item 3”?)

We can also ask about other publication types, but I haven’t checked the JSON yet to see whether it makes sense to handle the response from those slightly differently:

dtest___OUseful_Slack_5

At the moment, it’s all a bit Wizard of Oz, but it’s amazing how fluid you can be in writing queries that are matched by some very simple regular expressions:

dtest___OUseful_Slack_woz

So not bad for an hour or two’s play… Next steps would require getting a better idea about what sorts of conversation folk might want to have with the data, and what they actually expect to see in return. For example, it would be possible to mix in links to datafiles, or perhaps even upload datafiles to the slack channel?

PS Hmm, thinks.. what would a slack interface to a Jupyter server be like…?