Archive for the ‘Thinkses’ Category
Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.
Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.
So what is an API good for?
I think one naive view is that this is what an API gets you…
It doesn’t of course, because folk actually want this…
Which is not necessarily that easy even with an API:
For a variety of reasons…
Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.
And even then, once the data is in, the problems aren’t over…
(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)
So what is an API good for, and where does it actually exist?
I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.
Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.
What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.
I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)
And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.
APIs for users – not programmers. That’s what I want from an API.
PS See also this response from @apievangelist: The API Journey.
A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].
The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).
If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.
Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.
The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.
APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.
As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.
Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.
This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.
If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)
Erm… that’s it…
PS see also Evaluating big deal journal bundles (via @kpfssport)
PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.
Chasing the thought of Frictionless Data Analysis – Trying to Clarify My Thoughts, I wonder: how about if, in addition to the datapackage.json specification, there was a data analysis package or data analysis toolkit package specification? Perhaps the latter might be something that unpacks rather like the fig.yml file described in Using Docker to Build Linked Container Course VMs, and the former a combination of a datapackage and a data analysis toolkit package, that downloads a datapackage and opens it into a toolkit configuration specified by data analysis toolkit package. We’d perhaps also want to be able to define a set of data analysis scripts (data analysis script package???) relevant to working with a particular datapackage in the specified tools (for example, some baseline IPython notebooks or R/Rmd scripts?)
Prompted by a conversation with Rufus Pollock over lunch today, in part about data containerisation and the notion of “frictionless” data that can be easily discovered and is packaged along with metadata that helps you to import it into other tools or applications (such as a database), I’ve been confusing myself about what it might be like to have a frictionless data analysis working environment, where I could do something like write fda --datapackage http://example.com/DATAPACKAGE --db postgres --client rstudio ipynb and that would then:
- generate a fig script (eg as per something like Using Docker to Build Linked Container Course VMs);
- download the data package from the specified URL, unbundle it, create an SQL file to create an appropriate init file for the database specified, fire up the database and use the generated SQL file to configure the database by creating any necessary tables and loading the data in;
- fire up any specified client applications (IPython notebook and RStudio server in this example) and ideally seed them with SQL magic or database connection statements, for example, that automatically define an appropriate data connection to the database that’s just been configured;
- launch browser tabs that contain the clients;
- it might also be handy to be able to mount local directories against directory paths in the client applications, so I could have my R scripts in one directory of my own desktop, IPython notebooks in another, and then have visibility of those analysis scripts from the actual client applications.
The idea is that from a single command I can pull down a datafile, ingest it into a database, fire up one or more clients that are connected to that database, and start working with the data immediately. It’s not so different to double clicking on a file on your desktop and launching it into an application to start working on it, right?!
Can’t be that hard to wire up, surely?!;-) But would it be useful?
PS See also a further riff on this idea: Data Analysis Packages…?
I’ve been in a ranty mood all day today, so to finish it off, here are some thoughts about how we can start to use #opendata to hold companies to account. The trigger was finding a dataset released by the Care Quality COmmission (CQC) listing the locations of premises registered with the CQC, and the operating companies of those locations (early observations on that data here).
The information is useful because it provides a way of generating aggregated lists of companies that are part of the same corporate group (for example, locations operated by Virgin Care companies, or companies operated by Care UK). When we have these aggregation lists, it means we can start to run the numbers across all the companies in a corporate group, and get some data back about how the companies that are part of a group are operating in general. The aggregated lists thus provide a basis for looking at the gross behaviour of a particular company. We can then start to run league tables against these companies (folk love league tables, right? At least, they do when it comes to public sector bashing). So we can start to see how the corporate groupings compare against each other, and perhaps also against public providers. Of course, there is a chance that the private groups will be shown to be performing better than public sector bodies, but that could be a useful basis for a productive conversation about why…
So what sorts of aggregate lists can we start to construct? The CQC data allows us to get lists of locations associated with various sorts of care delivery (care home, GP services, dentistry, more specialist services) and identify locations that are part of the same corporate group. For example, I notice that filtering the CQC data to care homes, the following are significant operators (the number relates to the number of locations they operate):
Voyage 1 Limited 273 HC-One Limited 169 Barchester Healthcare Homes Limited 168
When it comes to “brands”, we have the following multiple operators:
BRAND Four Seasons Group 346 BRAND Voyage 279 BRAND BUPA Group 246 BRAND Priory Group 183 BRAND HC-One Limited 169 BRAND Barchester Healthcare 168 BRAND Care UK 130 BRAND Caretech Community Services 118
For these operators, we could start to scrape their most recent CQC reports and build up a picture of how well the group as a whole is operating. In the same way that “armchair auditors” (whatever they are?!) are supposed to be able to hold local councils to account, perhaps they can do the same for companies, and give the directors a helping hand… (I would love to see open data activists buying a share and going along to a company shareholder meeting to give some opendata powered grief ;-)
Other public quality data sites provide us with hints at ways of generating additional aggregations. For example, from the Food Standards Agency, we can search on ‘McDonalds’ as a restaurant to bootstrap a search into premises operated by that company (although we’d probably also need to add in searches across takeaways, and perhaps also look for things like ‘McDonalds Ltd” to catch more of them?).
Note – the CQC data provides a possible steer here for how other data sets might be usefully extended in terms of the data they make available. For example, having a field for “operating company” or “brand” would make for more effective searches across branded or operated food establishments. Having company number (for limited companies and LLPs etc) provided would also be useful for disambiguation purposes.
Hmm, I wonder – would it make sense to start to identify the information that makes registers useful, and that we should start to keep tabs on? We could then perhaps start lobbying for companies to provide that data, and check that such data is being and continues to be collected? It may not be a register of beneficial ownership, but it would provide handy cribs for trying to establish what companies are part of a corporate grouping…
(By the by, picking up on Owen Boswarva’s post The UK National Information Infrastructure: It’s time for the private sector to release some open data too, these registers provide a proxy for the companies releasing certain sorts of data. For example, we can search for ‘Tesco’ as a supermarket on the FSA site. Of course, if companies were also obliged to publish information about their outlets as open data – something you could argue that as a public company they should be required to do, trading their limited liability for open information about where they might exert that right – we could start to run cross-checks (which is the sort of thing real auditors do, right?) and publish complete records of publicly account performance in terms of regulated quality inspections.)
The CQC and Food Standards Agency both operate quality inspection registers, so what other registers might we go to to build up a picture of how companies – particularly large corporate groupings – behave?
The Environment Agency publish several registers, including one detailing enforcement actions, which might be interesting to track, though I’m not sure how the data is licensed? The HSE (Health & Safety Executive) publish various notices by industry sector and subsector, but again, I’m not too clear on the licensing? The Chief Fire Officers Association (CFOA) publish a couple of enforcement registers which look as if they cover some of the same categories as the CQC data – though how easy it would be to reconcile the two registers, I don’t know (and again, I don’t know how the license is actually registered). One thing to bear in mind is that where registers contain personally identifiable information, any aggregations we build that incorporates such data (if we are licensed to build such things) means (I think) that we become data controllers for the purposes of the Data Protection Act (we are not the maintainers and publishers of the public register so we don’t benefit from the exemptions associated with that role).
Looking at the above, I’m starting to think it could be a really interesting exercise to pick some of the care home provider groups and have a go at aggregating any applicable quality scores and enforcement notices from the CQC, FSA, HSE and CFOA (and even the EA if any of their notices apply! Hmm… does any HSCIC data cover care homes at all too?) Coupled with this, a trawl of directors data to see how the separate companies in a group connect by virtue of directors (and what other companies may be indicated by common directors in a group?).
Other areas perhaps worth exploring – farms incorporated into agricultural groups? (Where would be find that data? One register that could be used to partially hold those locations to account may be the public register of pesticide enforcement notices as well as other EA notices?)
As well as registers and are there any other sources of information about companies we can add in to the mix? There’s lots: for limited companies we can pull down company registration details and lists of directors (and perhaps struck off directors) and some accounting information. Data about charities should be available from the Charities Commission. The HSCIC produces care quality indicators for a range of health providers, as well as prescribing data for individual GP practices. Data is also available about some of the medical trials that particular practices are involved in.
At a local council level, local councils maintain and publish a wide variety of registers, including registers of gaming machine licenses, licensed premises and so on. Where the premises are an outlet of a parent corporate group, we may be able to pick up the name of the parent group as the licensee. (Via @OwenBoswarva, it seems the Gambling Commission has a central list of operating license holders and licensed premises.)
Having identified influential corporate players, we might then look to see whether those same bodies are represented on lobbiest groups, such as the EU register of commission expert groups, or as benefactors of UK Parliamentary All Party groups, or as parties to meetings with Ministers etc.
We can also look across all those companies to see how much money the corporate groups are sinking from the public sector, by inspecting who payments are made to in the masses of transparency spending data that councils, government departments, and services such as the NHS publish. (For an example of this, see Spend Small Local Authority Spending Index; unfortunately, the bulk data you need to run this sort of analysis yourself is not openly available – you need to aggregate and clean it yourself.)
Once we start to get data that lists companies that are part of a group, we can start to aggregate open public data about all the companies in the group and look for patterns of behaviour within the groups, as well as across them. Lapses in one part of the group might suggest a weakness in high level management (useful for the financial analysts?), or act as a red flag for inspection and quality regimes.
Hmmm… methinks it’s time to start putting some of this open data to work; but put it to work by focussing on companies, rather than public bodies…
I think I also need to do a little bit of digging around how public registers are licensed? Should they all be licensed OGL by default? And what guidance, if any, is there around how we can make use of such data and not breach the Data Protection Act?
PS via @RDBinns, What do they know about me? Open data on how organisations use personal data, describing some of the things we can find from the data protection notifications published by the ICO [ICO data controller register].
Via Downes, I like this idea of Flipping Bloom’s Taxonomy Triangle which draws on the following inverted pyramid originally posted here: Simplified Bloom’s Taxonomy Visual and comments on a process in which “students are spending the majority of their time in the Creating and Evaluating levels of Bloom’s Taxonomy, and they go down into the lower levels to acquire the information they need when they need it” (from Jonathan Bergmann and Aaron Sams’ Flip Your Classroom: Reach Every Student In Every Class Every Day, perhaps?)
Here’s another example, from a blog post by education consultant Scott Mcleod: Do students need to learn lower-level factual and procedural knowledge before they can do higher-order thinking?, or this one by teacher Shelley Wright: Flipping Bloom’s Taxonomy.
This makes some sort of sense to me, though if you (mistakenly?) insist on reading it as a linear process it lacks the constructivist context that shows how some knowledge and understanding can be used to inform the practice of the playful creating/evaluating/analysing exploratory layer, which might in itself be directed at trying to illuminate a misunderstanding or confusion the learner has with respect to their own knowledge at the understanding level. (In fact, the more I look at any model the more issues I tend to get with it when it comes to actually picking it apart!;-)
As far as “remembering” goes, I think that also includes ‘making up plausible stories or examples” – i.e. constructed “rememberings” (that is, stories) of things that never happened.
[Thinkses in progress – riffing around the idea that transparency is not reporting. This is all a bit confused atm…]
UK Health Secretary Jeremy Hunt was on BBC Radio 4’s Today programme today talking about a new “open and honest reporting culture” for UK hospitals. Transparency, it seems, is about publishing open data, or at least, putting crappy league tables onto websites. I think: not….
The fact that a hospital has “a number” of mistakes may or may not be interesting. As with most statistics, there is little actual information in a single number. As the refrain on the OU/BBC co-produced numbers programme More or Less goes, ‘is it a big number or a small number?’. The information typically lies in the comparison with other numbers, either across time or across different entities (for example, comparing figures across hospitals). But comparisons may also be loaded. For a fair comparison we need to normalise numbers – that is, we need to put them on the same footing.
[A tweet from @kdnuggets comments: ‘The question to ask is not – “is it a big number or a small number?”, but how it compares with other numbers’. The sense of the above is that such a comparison is always essential. A score of 9.5 in a test is a large number when the marks are out of ten, a small one when out of one hundred. Hence the need for normalisation, or some other basis for generating a comparison.]
The above cartoon from web comic XKCD demonstrates this with a comment about how reporting raw numbers on a map often tends to just produce a population map illustrates this well. If 1% of town A with population 1 million has causal incidence [I made that phrase up: I mean, the town somehow causes the incidence of X at that rate] of some horrible X (that is, 10,000 people get it as a result of living in town A), and town B with a population of 50,000 (that is, 5,000 people get X) has a causal incidence of 10%, a simple numbers map would make you fearful of living in town A, but you’d be likely worse off moving to town B.
Sometimes a single number may appear to be meaningful. I have £2.73 in my pocket so I have £2.73 to spend when I go to the beach. But again, there is a need for comparison here. £2.73 needs to be compared against the price of things it can purchase to inform my purchasing decisions.
In the opendata world, it seems that just publishing numbers is taken as transparency. But that’s largely meaningless. Even being able to compare numbers year on year, or month on month, or hospital on hospital, is largely meaningless, even if those comparisons can be suitably normalised. It’s largely meaningless because it doesn’t help me make sense of the “so what?” question.
Transparency comes from seeing how those numbers are used to support decision making. Transparency comes from seeing how this number was used to inform that decision, and why it influenced the decision in that way.
Transparency comes from unpacking the decisions that are “evidenced” by the opendata, or other data not open, or no data at all, just whim (or bad policy).
Suppose a local council spends £x thousands on an out-of area placement several hundred miles away. This may or may not be expensive. We can perhaps look at other placement spends and see that the one hundred of miles away appears to offer good value for money (it looks cheap compared to other placements; which maybe begs the question why those other placements are bing used if pure cost is a factor). The transparency comes from knowing how the open data contributed to the decision. In many cases, it will be impossible to be fully transparent (i.e. to fully justify a decision based on opendata) because there will be other factors involved, such as a consideration of sensitive personal data (clinical decisions based around medical factors, for example).
So what that there are z mistakes in a hospital, for league table purposes – although one thing I might care about is how z is normalised to provide a basis of comparison with other hospitals in a league table. Because league tables, sort orders, and normalisation make the data political. On the other hand – maybe I absolutely do want to know the number z – and why is it that number? (Why is it not z/2 or 2*z? By what process did z come into being? (We have to accept, unfortunately, that systems tend to incur errors. Unless we introduce self-correcting processes. I absolutely loved the idea of error-correcting codes when I was first introduced to them!) And knowing z, how does that inform the decision making of the hospital? What happens as a result of z? Would the same response be prompted if the number was z-1, or z/2? Would a different response be in order if the number was z+1, or would nothing change until it hit z*2? In this case the “comparison” comes from comparing the different decisions that would result from the number being different, or the different decisions that can be made given a particular number. The meaning of the number then becomes aligned to the different decisions that are taken for different values of that number. The number becomes meaningful in relation to the threshold values that the variable corresponding to that number are set at when it comes to triggering decisions.)
Transparency comes not from publishing open data, but from being open about decision making processes and possibly the threshold values or rates of change in indicators that prompt decisions. In many cases the detail of the decision may not be fully open for very good reason, in which case we need to trust the process. Which means understanding the factors involved in the process. Which may in part be “evidenced” through open data.
Going back to the out of area placement – the site hundreds of miles away may have been decided on by a local consideration, such as the “spot price” of the service provision. If financial considerations play a part in the decision making process behind making that placement, that’s useful to know. It might be unpalatable, but that’s the way the system works. But it begs the question – does the cost of servicing that placement (for example, local staff having to make round trips to that location, opportunity cost associated with not servicing more local needs incurred by the loss of time in meeting that requirement) also form part of the financial consideration made during the taking of that decision? The unit cost of attending a remote location for an intervention will inevitably be higher than attending a closer one.
If financial considerations are part of a decision, how “total” is the consideration of the costs?
That is very real part of the transparency consideration. To a certain extent, I don’t care that it costs £x for spot provision y. But I do want to know that finance plays a part in the decision. And I also want to know how the finance consideration is put together. That’s where the transparency comes in. £50 quid for an iPhone? Brilliant. Dead cheap. Contract £50 per month for two years. OK – £50 quid. Brilliant. Or maybe £400 for an iPhone and a £10 monthly contract for a year. £400? You must be joking. £1250 or £520 total cost of ownership? What do you think? £50? Bargain. #ffs
Transparency comes from knowing the factors involved in a decision. Transparency comes from knowing what data is available to support those decisions, and how the data is used to inform those decisions. In certain cases, we may be able to see some opendata to work through whether or not the evidence supports the decision based on the criteria that are claimed to be used as the basis for the decision making process. That’s just marking. That’s just checking the working.
The transparency bit comes from understanding the decision making process and the extent to which the data is being used to support it. Not the publication of the number 7 or the amount £43,125.26.
Reporting is not transparency. Transparency is knowing the process by which the reporting informs and influences decision making.
I’m not sure that “openness” of throughput is a good thing either. I’m not even sure that openness of process is a Good Thing (because then it can be gamed, and turned against the public sector by private enterprise). I’m not sure at all how transparency and openness relate? Or what “openness” actually applies to? The openness agenda creeps (as I guess I am proposing here in the context of “openness” around decision making) and I’m not sure that’s a good thing. I don’t think we have thought openness through and I’m not sure that it necessarily is such a Good Thing after all…
What I do think we need is more openness within organisations. Maybe that’s where self-correction can start to kick in, when the members of an organisation have access to its internal decision making procedures. Certainly this was one reason I favoured openness of OU content (eg Innovating from the Inside, Outside) – not for it to be open, per se, but because it meant I could actually discover it and make use of it, rather than it being siloed and hidden away from me in another part of the organisation, preventing me from using it elsewhere in the organisation.