Opening Up Access to Data: Why APIs May Not Be Enough…
Last week, a post on the ONS (Office of National Statistics) Digital Publishing blog caught my eye: Introducing the New Improved ONS API which apparently “mak[es] things much easier to work with”.
Ooh… exciting…. maybe I can use this to start hacking together some notebooks?:-)
It was followed a few days later by this one – ONS-API, Just the Numbers which described “a simple bit of code for requesting some data and then turning that into ‘just the raw numbers’” – a blog post that describes how to get a simple statistic, as a number, from the API. The API that “mak[es] things much easier to work with”.
After a few hours spent hacking away over the weekend, looking round various bits of the API, I still wasn’t really in a position to discover where to find the numbers, let alone get numbers out of the API in a reliable way. (You can see my fumblings here.) Note that I’m happy to be told I’m going about this completely the wrong way and didn’t find the baby steps guide I need to help me use it properly.
So FWIW, here are some reflections, from a personal standpoint, about the whole API thing from the perspective of someone who couldn’t get it together enough to get the thing working …
Most data users aren’t programmers. And I’m not sure how many programmers are data junkies, let alone statisticians and data analysts.
For data users who do dabble with programming – in R, for example, or python (for example, using the pandas library) – the offer of an API is often seen as providing a way of interrogating a data source and getting the bits of data you want. The alternative to this is often having to download a huge great dataset yourself and then querying it or partitioning it yourself to get just the data elements you want to make use of (for example, Working With Large Text Files – Finding UK Companies by Postcode or Business Area).
That’s fine, insofar as it goes, but it starts to give the person who wants to do some data analysis a data management problem too. And for data users who aren’t happy working with gigabyte data files, it can sometimes be a blocker. (Big file downloads also take time, and incur bandwidth costs.)
For me, a stereotypical data user might be someone who typically wants to be able to quickly and easily get just the data they want from the API into a data representation that is native to the environment they are working in, and that they are familiar with working with.
This might be a spreadsheet user or it might be a code (R, pandas etc) user.
In the same way that spreadsheet users want files in XLS or CSV format that they can easily open, (formats that can be also be directly opened into appropriate data structures in R or pandas), I increasingly look not for APIs, but for API wrappers, that bring API calls and the results from them directly into the environment I’m working in in a form appropriate to that environment.
So for example, in R, I make use of the FAOstat package, which also offers an interface to the World Bank Indicators datasets. In pandas, a remote data access handler for the World Bank Indicators portal allows me to make simple requests for that data.
At a level up (or should that be “down”?) from the API wrapper are libraries that parse typical response formats. For example, Statistics Norway seem to publish data using the json-stat format, the format used in the new ONS API update. This IPython notebook shows how to use the pyjstat python package to parse the json-stat data directly into a pandas dataframe (I couldn’t get it to work with the ONS data feed – not sure if the problem was me, the package, or the data feed; which is another problem – working out where the problem is…). For parsing data returned from SPARQL Linked Data endpoints, packages such as SPARQLwrapper get the data into Python dicts, if not pandas dataframes directly. (A SPARQL i/o wrapper for pandas could be quite handy?)
At the user level, IPython Notebooks (my current ‘can be used to solve all known problems’ piece of magic tech!;-) provide a great way of demonstrating not just how to get started with an API, but also encourage the development within the notebook or reusable components, as well as demonstrations of how to use the data. The latter demonstrations have the benefit of requiring that the API demo does actually get the data into a form that is useable within the environment. It also helps folk see what it means to be able to get data into the environment (it means you can do things like the things done in the demo…; and if you can do that, then you can probably also do other related things…)
So am I happy when I see APIs announced? Yes and no… I’m more interested in having API wrappers available within my data wrangling environment. If that’s a fully blown wrapper, great. If that sort of wrapper isn’t available, but I can use a standard data feed parsing library to parse results pulled from easily generated RESTful URLs, I can just about work out how to create the URLs, so that’s not too bad either.
When publishing APIs, it’s worth considering who can address them and use them. Just because you publish a data API doesn’t mean a data analyst can necessarily use the data, because they may not be (are likely not to be) a programmer. And if ten, or a hundred, or a thousand potential data users all have to implement the same sort of glue code to get the data from the API into the same sort of analysis environment, that’s not necessarily efficient either. (Data users may feel they can hack some code to get the data from the API into the environment for their particular use case, but may not be willing to release it as a general, tested and robust API wrapper, certainly not a stable production level one.)
This isn’t meant to be a slight against the ONS API, more a reflection on some of the things I was thinking as I hacked my weekend away…
PS I don’t know how easy it is to run Python code in R, but the R magic in IPython notebooks supports the running of R code within a notebook running a Python kernel, with the handing over of data from R dataframes to python dataframes. Which is to say, if there’s an R package available, for someone who can run R via an IPython context, it’s available via python too.
PPS I notice that from some of the ONS API calls we can get links to URLs of downloadable datasets (though when I tried some of them, I got errors trying to unzip the results). This provides an intermediate way of providing API access to a dataset – search based API calls that allow discovery of a dataset, then the download and automatic unpacking of that dataset into a native data representation, such as one or more data frames.