- Get in a ghost writer to write your publications for you; inspired the old practice of taking first authorship on work done by your research assistant/postgrad etc etc…
- Use your influence – tell your research assistant/postgrad/unpublished colleague that you’ll add your name to their publication and your chums in the peer review pool will let it through;
- Comment on everything anyone sends you or tells you – where possible, make changes inline to any document that anyone sends you (rather than commenting in the margins) – and make it so difficult to to disentangle what you’ve added that they’re forced to give you an author credit. Alternatively, where possible, make structural changes to the organisation of a paper early on so that other authors think you’ve contributed more than you have… Reinforce these by commenting on “the paper you’re writing with X” to every one else so they think it actually is a joint paper;
- Give your work away because you’re too lazy to write it up – start a mentoring or academic writing scheme, write half baked, unfinished articles and get unpublished academics or academics playing variants of the games above to finish them off for you.
- Identify someone who has lot of started but not quite finished papers and offer to help them bring them to completion in exchange for an authorship credit.
Note that some of the options may be complementary and allow two people to exploit each other…
Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.
Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.
So what is an API good for?
I think one naive view is that this is what an API gets you…
It doesn’t of course, because folk actually want this…
Which is not necessarily that easy even with an API:
For a variety of reasons…
Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.
And even then, once the data is in, the problems aren’t over…
(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)
So what is an API good for, and where does it actually exist?
I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.
Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.
What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.
I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)
And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.
APIs for users – not programmers. That’s what I want from an API.
PS See also this response from @apievangelist: The API Journey.
A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].
The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).
If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.
Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.
The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.
APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.
As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.
Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.
This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.
If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)
Erm… that’s it…
PS see also Evaluating big deal journal bundles (via @kpfssport)
PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.
Chasing the thought of Frictionless Data Analysis – Trying to Clarify My Thoughts, I wonder: how about if, in addition to the datapackage.json specification, there was a data analysis package or data analysis toolkit package specification? Perhaps the latter might be something that unpacks rather like the fig.yml file described in Using Docker to Build Linked Container Course VMs, and the former a combination of a datapackage and a data analysis toolkit package, that downloads a datapackage and opens it into a toolkit configuration specified by data analysis toolkit package. We’d perhaps also want to be able to define a set of data analysis scripts (data analysis script package???) relevant to working with a particular datapackage in the specified tools (for example, some baseline IPython notebooks or R/Rmd scripts?)
Prompted by a conversation with Rufus Pollock over lunch today, in part about data containerisation and the notion of “frictionless” data that can be easily discovered and is packaged along with metadata that helps you to import it into other tools or applications (such as a database), I’ve been confusing myself about what it might be like to have a frictionless data analysis working environment, where I could do something like write fda --datapackage http://example.com/DATAPACKAGE --db postgres --client rstudio ipynb and that would then:
- generate a fig script (eg as per something like Using Docker to Build Linked Container Course VMs);
- download the data package from the specified URL, unbundle it, create an SQL file to create an appropriate init file for the database specified, fire up the database and use the generated SQL file to configure the database by creating any necessary tables and loading the data in;
- fire up any specified client applications (IPython notebook and RStudio server in this example) and ideally seed them with SQL magic or database connection statements, for example, that automatically define an appropriate data connection to the database that’s just been configured;
- launch browser tabs that contain the clients;
- it might also be handy to be able to mount local directories against directory paths in the client applications, so I could have my R scripts in one directory of my own desktop, IPython notebooks in another, and then have visibility of those analysis scripts from the actual client applications.
The idea is that from a single command I can pull down a datafile, ingest it into a database, fire up one or more clients that are connected to that database, and start working with the data immediately. It’s not so different to double clicking on a file on your desktop and launching it into an application to start working on it, right?!
Can’t be that hard to wire up, surely?!;-) But would it be useful?
PS See also a further riff on this idea: Data Analysis Packages…?
I’ve been in a ranty mood all day today, so to finish it off, here are some thoughts about how we can start to use #opendata to hold companies to account. The trigger was finding a dataset released by the Care Quality COmmission (CQC) listing the locations of premises registered with the CQC, and the operating companies of those locations (early observations on that data here).
The information is useful because it provides a way of generating aggregated lists of companies that are part of the same corporate group (for example, locations operated by Virgin Care companies, or companies operated by Care UK). When we have these aggregation lists, it means we can start to run the numbers across all the companies in a corporate group, and get some data back about how the companies that are part of a group are operating in general. The aggregated lists thus provide a basis for looking at the gross behaviour of a particular company. We can then start to run league tables against these companies (folk love league tables, right? At least, they do when it comes to public sector bashing). So we can start to see how the corporate groupings compare against each other, and perhaps also against public providers. Of course, there is a chance that the private groups will be shown to be performing better than public sector bodies, but that could be a useful basis for a productive conversation about why…
So what sorts of aggregate lists can we start to construct? The CQC data allows us to get lists of locations associated with various sorts of care delivery (care home, GP services, dentistry, more specialist services) and identify locations that are part of the same corporate group. For example, I notice that filtering the CQC data to care homes, the following are significant operators (the number relates to the number of locations they operate):
Voyage 1 Limited 273 HC-One Limited 169 Barchester Healthcare Homes Limited 168
When it comes to “brands”, we have the following multiple operators:
BRAND Four Seasons Group 346 BRAND Voyage 279 BRAND BUPA Group 246 BRAND Priory Group 183 BRAND HC-One Limited 169 BRAND Barchester Healthcare 168 BRAND Care UK 130 BRAND Caretech Community Services 118
For these operators, we could start to scrape their most recent CQC reports and build up a picture of how well the group as a whole is operating. In the same way that “armchair auditors” (whatever they are?!) are supposed to be able to hold local councils to account, perhaps they can do the same for companies, and give the directors a helping hand… (I would love to see open data activists buying a share and going along to a company shareholder meeting to give some opendata powered grief ;-)
Other public quality data sites provide us with hints at ways of generating additional aggregations. For example, from the Food Standards Agency, we can search on ‘McDonalds’ as a restaurant to bootstrap a search into premises operated by that company (although we’d probably also need to add in searches across takeaways, and perhaps also look for things like ‘McDonalds Ltd” to catch more of them?).
Note – the CQC data provides a possible steer here for how other data sets might be usefully extended in terms of the data they make available. For example, having a field for “operating company” or “brand” would make for more effective searches across branded or operated food establishments. Having company number (for limited companies and LLPs etc) provided would also be useful for disambiguation purposes.
Hmm, I wonder – would it make sense to start to identify the information that makes registers useful, and that we should start to keep tabs on? We could then perhaps start lobbying for companies to provide that data, and check that such data is being and continues to be collected? It may not be a register of beneficial ownership, but it would provide handy cribs for trying to establish what companies are part of a corporate grouping…
(By the by, picking up on Owen Boswarva’s post The UK National Information Infrastructure: It’s time for the private sector to release some open data too, these registers provide a proxy for the companies releasing certain sorts of data. For example, we can search for ‘Tesco’ as a supermarket on the FSA site. Of course, if companies were also obliged to publish information about their outlets as open data – something you could argue that as a public company they should be required to do, trading their limited liability for open information about where they might exert that right – we could start to run cross-checks (which is the sort of thing real auditors do, right?) and publish complete records of publicly account performance in terms of regulated quality inspections.)
The CQC and Food Standards Agency both operate quality inspection registers, so what other registers might we go to to build up a picture of how companies – particularly large corporate groupings – behave?
The Environment Agency publish several registers, including one detailing enforcement actions, which might be interesting to track, though I’m not sure how the data is licensed? The HSE (Health & Safety Executive) publish various notices by industry sector and subsector, but again, I’m not too clear on the licensing? The Chief Fire Officers Association (CFOA) publish a couple of enforcement registers which look as if they cover some of the same categories as the CQC data – though how easy it would be to reconcile the two registers, I don’t know (and again, I don’t know how the license is actually registered). One thing to bear in mind is that where registers contain personally identifiable information, any aggregations we build that incorporates such data (if we are licensed to build such things) means (I think) that we become data controllers for the purposes of the Data Protection Act (we are not the maintainers and publishers of the public register so we don’t benefit from the exemptions associated with that role).
Looking at the above, I’m starting to think it could be a really interesting exercise to pick some of the care home provider groups and have a go at aggregating any applicable quality scores and enforcement notices from the CQC, FSA, HSE and CFOA (and even the EA if any of their notices apply! Hmm… does any HSCIC data cover care homes at all too?) Coupled with this, a trawl of directors data to see how the separate companies in a group connect by virtue of directors (and what other companies may be indicated by common directors in a group?).
Other areas perhaps worth exploring – farms incorporated into agricultural groups? (Where would be find that data? One register that could be used to partially hold those locations to account may be the public register of pesticide enforcement notices as well as other EA notices?)
As well as registers and are there any other sources of information about companies we can add in to the mix? There’s lots: for limited companies we can pull down company registration details and lists of directors (and perhaps struck off directors) and some accounting information. Data about charities should be available from the Charities Commission. The HSCIC produces care quality indicators for a range of health providers, as well as prescribing data for individual GP practices. Data is also available about some of the medical trials that particular practices are involved in.
At a local council level, local councils maintain and publish a wide variety of registers, including registers of gaming machine licenses, licensed premises and so on. Where the premises are an outlet of a parent corporate group, we may be able to pick up the name of the parent group as the licensee. (Via @OwenBoswarva, it seems the Gambling Commission has a central list of operating license holders and licensed premises.)
Having identified influential corporate players, we might then look to see whether those same bodies are represented on lobbiest groups, such as the EU register of commission expert groups, or as benefactors of UK Parliamentary All Party groups, or as parties to meetings with Ministers etc.
We can also look across all those companies to see how much money the corporate groups are sinking from the public sector, by inspecting who payments are made to in the masses of transparency spending data that councils, government departments, and services such as the NHS publish. (For an example of this, see Spend Small Local Authority Spending Index; unfortunately, the bulk data you need to run this sort of analysis yourself is not openly available – you need to aggregate and clean it yourself.)
Once we start to get data that lists companies that are part of a group, we can start to aggregate open public data about all the companies in the group and look for patterns of behaviour within the groups, as well as across them. Lapses in one part of the group might suggest a weakness in high level management (useful for the financial analysts?), or act as a red flag for inspection and quality regimes.
Hmmm… methinks it’s time to start putting some of this open data to work; but put it to work by focussing on companies, rather than public bodies…
I think I also need to do a little bit of digging around how public registers are licensed? Should they all be licensed OGL by default? And what guidance, if any, is there around how we can make use of such data and not breach the Data Protection Act?
PS via @RDBinns, What do they know about me? Open data on how organisations use personal data, describing some of the things we can find from the data protection notifications published by the ICO [ICO data controller register].
Via Downes, I like this idea of Flipping Bloom’s Taxonomy Triangle which draws on the following inverted pyramid originally posted here: Simplified Bloom’s Taxonomy Visual and comments on a process in which “students are spending the majority of their time in the Creating and Evaluating levels of Bloom’s Taxonomy, and they go down into the lower levels to acquire the information they need when they need it” (from Jonathan Bergmann and Aaron Sams’ Flip Your Classroom: Reach Every Student In Every Class Every Day, perhaps?)
Here’s another example, from a blog post by education consultant Scott Mcleod: Do students need to learn lower-level factual and procedural knowledge before they can do higher-order thinking?, or this one by teacher Shelley Wright: Flipping Bloom’s Taxonomy.
This makes some sort of sense to me, though if you (mistakenly?) insist on reading it as a linear process it lacks the constructivist context that shows how some knowledge and understanding can be used to inform the practice of the playful creating/evaluating/analysing exploratory layer, which might in itself be directed at trying to illuminate a misunderstanding or confusion the learner has with respect to their own knowledge at the understanding level. (In fact, the more I look at any model the more issues I tend to get with it when it comes to actually picking it apart!;-)
As far as “remembering” goes, I think that also includes ‘making up plausible stories or examples” – i.e. constructed “rememberings” (that is, stories) of things that never happened.