A few days ago, I saw via the @HSCICOpenData Twitter feed that an annually released dataset on Written Complaints in the NHS has just been published.
The data comes in the form of a couple of spreadsheets in which each row describes a count of the written complaints received and upheld under a variety of categories for each GP and dental practice, or local NHS trust.
The practice level spreadsheet looks like this:
Each practice is identified solely by a practice code – to find the name and address of the actual practice requires looking up the code in another dataset.
The column headings supplied in the CSV document only partially identify each column (and indeed, there are duplicates such as Total number of written complaints received, that a spreadsheet reader might disambiguate by adding numerical suffix to) – a more complete description (that shows how the columns are actually hierarchically defined) is provided in an associated metadata spreadsheet.
For a reporter wanting to know whether or not any practices in their area fared particularly badly in terms of the number of upheld complaints, the task might be broken down as follows:
- identify the practices in of interest from their practice codes (which requires finding a set of practice codes of interest);
- for each of those practices, look along the row to see whether or not there are any large numbers in the complaints upheld column.
But if you have a spreadsheet with 10, 20, 30 or more columns, scanning along a row looking for items of interest can rapidly become quite a daunting task.
So an idea I’ve been working on, which I suspect harkens back to the earliest days of database reporting, is to look at ways of turning each row of data into a text based, human readable report.
Something like the following, for example:
Each record, each “Complaint Report”, is a textual rendering of a single row from the practice complaints spreadsheet, with a bit of administrative metadata enrichment in the form of the practice name, address (and in later versions, telephone number).
These reports are quicker to scan, and could be sort or highlighted depending on the number of upheld complaints, for example. The journalist can then quickly review the reports, and identify any practices that might be worth phoning up for a comment to ask why they appear to have received a large number of upheld complaints in a particular area, for example… Data driven press releases used to assist reporting, in other words.
FWIW, I popped up a sketch script that generates the above report from the data, and also pulls in practice administrative metadata from an epracurr spreadsheet, here: NHS complaints spreadsheet2text sketch. See also: Data Driven Press Releases From HSCIC Data – Diabetes Prescribing.
PS I’m not Microsoft Office suite user, but I suspect you can get a fair way along this sort of process by using a mail merge? There may be other ways of generating templated reports too. Any Microsoft Office users fancy letting me know how you’d go about doing something like the above in Word and Excel? I’d guess complicating factors are the requirements to make use of the column headers and only display the items associated with non-zero counts, which perhaps requires some macro magic? Things could perhaps be simplified by reshaping the data, perhaps putting it into a long form by melting the complaints columns, or melting the complaints columns cannily to provide two value columns, once for complaints received and one for complaints upheld?
Then you could filter out the blank rows before the merge.
In the first part of this series (Notes on Robot Churnalism, Part I – Robot Writers), I reviewed some of the ways in which robot writers are able to contribute to the authoring of news content.
In this part, I will consider some of the impacts that might arise from robots entering the workplace.
Robot Journalism in the Workplace
“Robot journalists” have some competitive advantages which are hard for human journalists to compete with. The strengths of automated content generation are the low marginal costs, the speed with which articles can be written and the broad spectrum of sport events which can be covered.
Arjen van Dalen, The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p652
One thing machines do better is create value from large amounts of data at high speed. Automation of process and content is the most under-explored territory for reducing costs of journalism and improving editorial output. Within five to 10 years, we will see cheaply produced information monitored on networks of wireless devices.
Post Industrial Journalism: Adapting to the Present, Chris Anderson, Emily Bell, Clay Shirky, Tow Center for Digital Journalism Report, December 3, 2014
Year on year, it seems, the headlines report how the robots are coming to take over a wide range of professional jobs and automate away the need to employ people to fill a wide range of currently recognised roles (see, for example, this book: The Second Machine Age [review], this Observer article: Robots are leaving the factory floor and heading for your desk – and your job, this report: The Future of Employment: How susceptible are jobs to computerisation? [PDF], this other report: AI, Robotics, and the Future of Jobs [review], and this business case: Rethink Robotics: Finding a Market).
Stories also abound fearful of a possible robotic takeover of the newsroom: ‘Robot Journalist’ writes a better story than human sports reporter (2011), The robot journalist: an apocalypse for the news industry? (2012), Can an Algorithm Write a Better News Story Than a Human Reporter? (2012), Robot Writers and the Digital Age (2013), The New Statesman could eventually be written by a computer – would you care? (2013), The journalists who never sleep (2014), Rise of the Robot Journalist (2014), Journalists, here’s how robots are going to steal your job (2014), Robot Journalist Finds New Work on Wall Street (2015).
It has to be said, though, that many of these latter “inside baseball” stories add nothing new, perhaps reflecting the contributions of another sort of robot to the journalistic process: web search engines like Google…
Looking to the academic literature, in his 2015 case study around Narrative Science, Matt Carlson describes how “public statements made by its management reported in news about the company reveal two commonly expressed beliefs about how its technology will improve journalism: automation will augment— rather than displace — human journalists, and it will greatly expand journalistic output” p420 (Matt Carlson (2015), The Robotic Reporter, Digital Journalism, 3:3, 416-431).
As with the impact of many other technological innovations within the workplace, “[a]utomated journalism’s ability to generate news accounts without intervention from humans raises questions about the future of journalistic labor” (Carlson, 2015, p422). In contrast to the pessimistic view that “jobs will lost”, there are at least two possible positive outcomes for jobs that may result from the introduction of a new technology: firstly, that the technology helps transform the original job and in so doing help make it more rewarding, or allows the original worker to “do more”; secondly, that the introduction of the new technology creates new roles and new job opportunities.
On the pessimistic side, Carlson describes how:
many journalists … question Narrative Science’s prediction that its service would free up or augment journalists, including Mathew Ingram (GigaOm, April 25, 2012): “That’s a powerful argument, but it presumes that the journalists who are ‘freed up’ because of Narrative Science … can actually find somewhere else that will pay them to do the really valuable work that machines can’t do. If they can’t, then they will simply be unemployed journalists.” This view challenges the virtuous circle suggested above to instead argue that some degree of displacement is inevitable.(Carlson, 2015, p423)
On the other hand:
[a]ccording to the more positive scenario, machine-written news could be complementary to human journalists. The automation of routine tasks offers a variety of possibilities to improve journalistic quality. Stories which cannot be covered now due to lack of funding could be automated. Human journalists could be liberated from routine tasks, giving them more time to spend on quality, in-depth reporting, investigative reporting. (van Dalen, p653)
This view thus represents the idea of algorithms working alongside the human journalists, freeing them up from the mundane tasks and allow them to add more value to a story… If a journalist has 20 minutes to spend on a story, if that time is spent searching a database and pulling out a set of numbers that may not even be very newsworthy, how much more journalistically productive could that journalist be if a machine gave them the data and a canned summary of it for free, then allowing the journalist to use the few minutes allocated to that story to take the next step – adding in some context, perhaps, or contacting a second source for comment?
A good example of the time-saving potential of automated copy production can be seen in the publication of earnings reports by AP, as reported by trade blog journalism.co.uk, who quoted vice president and managing editor Lou Ferrara’s announcement of a tenfold increase in stories from 300 per quarter produced by human journalists, to 3,700 with machine support (AP uses automation to increase story output tenfold, June, 2015).
The process AP went through during testing appears to be one that I’m currently exploring with my hyperlocal, OnTheWight, for producing monthly JobSeekers Allowance reports (here’s an example of the human produced version, which in this case was corrected after a mistake was spotted when checking that an in-testing machine generated version of the report was working correctly..! As journalism.co.uk reported about AP, “journalists were doing all their own manual calculations to produce the reports, which Ferrara said had ‘potential for error’.” Exactly the same could have been said of the OnTheWight process…)
In the AP case, “during testing, the earnings reports were produced via automation and journalists compared them to the relevant press release and figured out bugs before publishing them. A team of five reporters worked on the project, and Ferrara said they still had to check for everything a journalist would normally check for, from spelling mistakes to whether the calculations were correct.” (I wonder if they check the commas, too?!) The process I hope to explore with OnTheWight builds in the human checking route, taking the view that the machine should generate press-release style copy that does the grunt work in getting the journalist started on the story, rather than producing the complete story for them. At AP, it seems that automation “freed up staff time by one fifth”. The process I’m hoping to persuade OnTheWight to adopt is that to begin with, the same amount of time should be spent on the story each month, but month on month we automate a bit more and the journalistic time is then spent working up what the next paragraph might be, and then in turn automate the production of that…
Extending the Promise?
In addition to time-saving, there is the hope that the wider introduction of robot journalists will create new journalistic roles:
Beyond questions of augmentation or elimination, Narrative Science’s vision of automated journalism requires the transformation of journalistic labor to include such new positions as “meta-writer” or “metajournalist” to facilitate automated stories. For example, Narrative Science’s technology can only automate sports stories after journalists preprogram it with possible frames for sports stories (e.g., comeback, blowout, nail-biter, etc.) as well as appropriate descriptive language. After this initial programming, automated journalism requires ongoing data management. Beyond the newsroom, automated journalism also redefines roles for non-journalists who participate in generating data. (Carlson, 2015, p423)
In the first post of these series, I characterised the process used by Narrative Science which included the application of rules for detecting signals and angles, and the linkage of detected “facts” to story points within an a particular angle that could then be used to generate a narrative told through automatically generated natural language. Constructing angles, identifying logical processes that can identify signals and map them on to story elements, and generating turns of phrase that can help explicate narratives in a natural way are all creative acts that are likely to require human input for the near future at least, albeit tasking the human creative with the role of supporting the machine. This is not necessarily that far removed from the some of the skills already employed by journalists, however. As Carlson suggests, “Scholars have long documented the formulaic nature underlying compositional forms of news exposed by the arrival of automated news. … much journalistic writing is standardized to exclude individual voice. This characteristic makes at least a portion of journalistic output susceptible to automation” (p425). What’s changing, perhaps, is that now the journalists mush learn to capture those standardised forms and map them onto structures that act as programme fodder for their robot helpers.
Narrative Science also see potential in increasing the size of the total potential audience by accommodating the very specific needs of a large number of niche audiences.
“While Narrative Science flaunts the transformative potential of automated journalism to alter both the landscape of available news and the work practices of journalists, its goal when it comes to compositional form is conformity with existing modes of human writing. The relationship here is telling: the more the non-human origin of its stories is undetectable, the more it promises to disrupt news production. But even in emulating human writing, the application of Narrative Science’s automation technology to news prompts reconsiderations of the core qualities underpinning news composition. The attention to the quality and character of Narrative Science’s automated news stories reflects deep concern both with existing news narratives and with how automated journalistic writing commoditizes news stories.” Carlson, 2015, p424
In the midst of this mass of stories, it’s possible that there will be some “outliers” that are of more general interest which can, with some additional contextualisation and human reporting, be made relevant to a wider audience.
There is also the possible of searching for “meta-stories” that tell not the specifics of particular cases, but identify trends across the mass of stories as whole. (Indeed, it is by looking for such trends and patterns that outliers may be detected). In addition, patterns that only become relevant when looking across all the individual stories might in turn lead to additional stories. (For example, a failing school operated by a particular provider is perhaps of only local interest, but if it turns out that the majority of schools operated by a particular provider we turned round from excellent to failing by that provider, questions might, perhaps, be worth asking…?!)
When it comes to the case for expanding the range of content that is available, Narrative Science’s hope appears to be that:
[t]he narrativization of data through sophisticated artificial intelligence programs vastly expands the terrain of news. Automated journalism becomes a normalized component of the news experience. Moreover, Narrative Science has tailored its promotional discourse to reflect the economic uncertainty of online journalism business models by suggesting that its technology will create a virtuous circle in which increased news revenue supports more journalists (Carlson, 2015, p 421).
The alternative, fearful view, of course, is that revenues will be protected by reducing the human wage bill, using robot content creators operating at a near zero marginal cost on particular story types to replace human content creation.
Whether news organisations will use automation to extend the range of producers in the newsroom, or contribute to the reduction of human creative input to the journalistic process, is perhaps still to be seen. As Anderson, Bell & Shirky noted, “the reality is that most journalists at most newspapers do not spend most of their time conducting anything like empirically robust forms of evidence gathering.” Perhaps now is the time for them to stop churning the press releases and statistics announcements – after all, the machines can do that faster and better – and concentrate more on contextualising and explaining the machine generated stories, as well as spending more time out hunting for stories and pursuing their own investigative leads?
In Some Notes on Churnalism and a Question About Two Sided Markets, I tried to pull together a range of observations about the process of churnalism, in which journalists propagate PR copy without much, if any, critique, contextualisation or corroboration.
If that view in any way represents a fair description of how some pre-packaged content, at least, makes its way through to becoming editorial content, where might the robots fit in? To what extent might we start to see “robot churnalism“, and what form or forms might it take?
There are two particular ways in which we might consider robot churnalism:
- “robot journalists” that produce copy acts as a third conveyor belt complementary to PA-style wire and PR feedstocks;
- robot churnalists as ‘reverse’ gatekeepers, choosing what wire stories to publish where based on traffic stats and web analytics.
A related view is taken by Philip Napoli (“Automated media: An institutional theory perspective on algorithmic media production and consumption.” Communication Theory 24.3 (2014): 340-360; a shorter summary of the key themes can be found here) who distinguishes roles for algorithms in “(a) media consumption and (b) media production”. He further refines the contributions algorithms may make in media production by suggesting that “[t]wo of the primary functions that algorithms are performing in the media production realm at this point are: (a) serving as a demand predictor and (b) serving as content creator.”
“Automated content can be seen as one branch of what is known as algorithmic news” writes Christer Clerwall (2014, Enter the Robot Journalist, Journalism Practice, 8:5, pp519-531), a key component of automated journalism “in which a program turns data into a news narrative, made possible with limited — or even zero — human input” (Matt Carlson (2015) The Robotic Reporter, Digital Journalism, 3:3, 416-431).
In a case study based around the activities of Narrative Science, a company specialising in algorithmically created, data driven narratives, Carlson further conceptualises “automated journalism” as “algorithmic processes that convert data into narrative news texts with limited to no human intervention beyond the initial programming”. He goes on:
The term denotes a split from data analysis as a tool for reporters encompassed in writings about “computational and algorithmic journalism” (Anderson 2013) to indicate wholly computer-written news stories emulating the compositional and framing practices of human journalism (ibid, p417).
Even several years ago, Arjen van Dalen observed that “[w]ith the introduction of machine-written news computational journalism entered a new phase. Each step of the news production process can now be automated: “robot journalists” can produce thousands of articles with virtually no variable costs” (The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p649).
Sport and financial reporting examples abound from the bots of Automated Insights and Narrative Science (for example, Notes on Narrative Science and Automated Insights or Pro Publica: How To Edit 52,000 Stories at Once, and more recently e.g. Robot-writing increased AP’s earnings stories by tenfold), with robot writers generating low-cost content to attract page views, “producing content for the long tail, in virtually no time and with low additional costs for articles which can be produced in large quantities” (ibid, p649).
Although writing back in 2012, van Dalen noted in his report on “the responses of the journalistic community to automatic content creation” that:
[t]wo main reasons are mentioned to explain why automated content generation is a trend that needs to be taken seriously. First, the journalistic profession is more and more commercialized and run on the basis of business logics. The automation of journalism tasks fits in with the trend to aim for higher profit margins and lower production costs. The second reason why automated content creation might be successful is the quality of stories with which it is competing. Computer-generated news articles may not be able to compete with high quality journalism provided by major news outlets, which pay attention to detail, analysis, background information and have more lively language or humour. But for information which is freely available on the Internet the bar is set relatively low and automatically generated content can compete (ibid, p651).
As Christer Clerwall writes in Enter the Robot Journalist, (Journalism Practice, 8:5, 2014, pp519-531):
The advent of services for automated news stories raises many questions, e.g. what are the implications for journalism and journalistic practice, can journalists be taken out of the equation of journalism, how is this type of content regarded (in terms of credibility, overall quality, overall liking, to mention a few aspects) by the readers? p520.
van Dalen puts it thus:
Automated content creation is seen as serious competition and a threat for the job security of journalists performing basic routine tasks. When routine journalistic tasks can be automated, journalists are forced to offer a better product in order to survive. Central in these reflections is the need for journalists to concentrate on their own strengths rather than compete on the strengths of automated content creation. Journalists have to become more creative in their writing, offer more in-depth coverage and context, and go beyond routine coverage, even to a larger extent than they already do today (ibid, p653).
He then goes on to produce the following SWOT analysis to explore just how the humans and the robots compare:
One possible risk associated with the automated production of copy is that it becomes published without human journalistic intervention, and as such is not necessarily “known”, or even read, by any member at all of the publishing organisation. To paraphrase Daniel Jackson and Kevin Moloney, “Inside Churnalism: PR, journalism and power relationships in flux”, Journalism Studies, 2015, this would represent an extreme example of churnalism in the sense of “the use of unchecked [robot authored] material in news”.
This is dangerous, I think, on many levels. The more we leave the setting of the news agenda and the identification of news values to machines, the more we lose any sensitivity to what’s happening in the world around us and what stories are actually important to an audience as opposed to merely being Like-bait titillation. (As we shall see, algorithmic gatekeepers that channel content to audiences based on various analytics tools respond to one definition of what audiences value. But it is not clear that these are necessarily the same issues that might weigh more heavily in a personal-political sense. Reviews of the notion of “hard” vs. “soft” news (e.g. Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239)) may provide lenses to help think about this more deeply?)
Of course, machines can also be programmed to look for links and patterns across multiple sources of information and at far greater scale than a human journalist could hope to cover, but we are then in danger of creating some sort of parallel news world, where events are only recognised, “discussed” and acted upon by machines and human actors are oblivious to them. (For an example, The Wolf of Wall Tweet: A Web-reading bot made millions on the options market. It also ate this guy’s lunch that describes how bots read the news wires and trade off the back them. They presumably also read wire stories created by other bots…)
So What It Is That Robot Writers Actually Do All Day?
In a review of Associated Press’ use of Automated Insight’s Wordsmith application (In the Future, Robots Will Write News That’s All About You), Wired reported that Wordsmith “essentially does two things. First, it ingests a bunch of structured data and analyzes it to find the interesting points, such as which players didn’t do as well as expected in a particular game. Then it weaves those insights into a human readable chunk of text.”
One way of getting deeper into the mind of a robot writer is to look to the patents held by the companies who develop such applications. For example, in The Anatomy of a Robot Journalist, one process used by Narrative Science is characterised as follows:
Identifying newsworthy features (or story points) is a process of identifying features and then filtering out the ones that are somehow notable. Angles are possibly defined as in terms of sets of features that need to be present within a particular dataset for that angle to provide a possible frame for story. The process of reconciling interesting features with angle points populates the angle with known facts, and a story engine then generates the natural language text within a narrative structure suited to an explication of the selected angle.
(An early – 2012 – presentation by Narrative Science’s Larry Adams also reviews some of the technicalities: Using Open Data to Generate Personalized Stories.)
In actual fact, the process may be a relatively straightforward one, as demonstrated by the increasing numbers of “storybots” that populate social media. One well known class of examples are earthquake bots that tweet news of earthquakes (see also: When robots help human journalists: “This post was created by an algorithm written by the author”). (It’s easy enough to see various newsworthiness filters might work here: a geo-based one for reporting a story locally, a wider interest one for reporting an earthquake above a particular magnitude, and so on.)
It’s also easy enough to create your own simple storybot (or at least, an “announcer bot”) using something like IFTT that can take in an RSS feed and make a tweet announcement about each new item. A collection of simple twitterbots produced as part of a journalism course on storybots, along with code examples, can be found here: A classroom experiment in Twitter Bots and creativity. Here’s another example, for a responsive weatherbot that tries to geolocate someone sending a message to the bot and respond to them with a weather report for their location.
Not being of a journalistic background, and never having read much on media or communications theory, I have to admit I don’t really have a good definition for what angles are, or a typology for them in different topic areas, and I’m struggling to find any good structural reviews of the idea, perhaps because it’s so foundational? For now, I’m sticking with a definition of “an angle” as being something along the lines of the thing you want focus on and dig deeper around within the story (the thing you want to know more about or whose story you want to tell; this includes abstract things: the story of an indicator value for example, over time). The blogpost Framing and News Angles: What is Bias? contrasts angles with the notions of framing and bias. Entman, Robert M. “Framing: Towards clarification of a fractured paradigm.” McQuail’s reader in mass communication theory (1993): 390-397 [pdf] seems foundational in terms of the framing idea, De Vreese, Claes H. “News framing: Theory and typology.” Information design journal & document design 13.1 (2005): 51-62 [PDF] offers a review (of sorts) of some related literature, and Reinemann, C., Stanyer, J., Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239 (PDF) perhaps provides another way in to related literature? Bias is presumably implicit in the selection of any particular frame or angle? Blog posts such as What makes a press release newsworthy? It’s all in the news angle look to be linkbait, perhaps even stolen content (eg here’s a PDF), but I can’t offhand find a credible source or inspiration for the original list? Resource packs like this one on Working with the Media from the FAO gives a crash course into what I guess are some of the generally taught basics around story construction?
In the WordPress editor I’m currently writing in, I’m using a Text view that lets me write vanilla HTML; but there is also a WYSIWYG (what you see is what you get) view that shows how the interpreted HTML text will look when it is rendered in the browser as a web page.
Reflecting on IPython Markdown Opportunities in IPython Notebooks and Rstudio, it struck me that the Rmd (Rmarkdown) view used in RStudio, the HTML preview of “executed” Rmd documents generated from Rmd by knitr and the interactive Jupyter (IPython, as was) notebook view can be seen as standing in this sort of relation to each other:
From that, it’s not too hard to imagine RStudio offering the following sort of RStudio/IPython notebook hybrid interface – with an Rmd “text” view, and with a notebook “visual” view (eg via an R notebook kernel):
And from both, we can generate the static HTML preview view.
In terms of underlying machinery, I guess we could have something like this:
I’m looking forward to it:-)
- Get in a ghost writer to write your publications for you; inspired the old practice of taking first authorship on work done by your research assistant/postgrad etc etc…
- Use your influence – tell your research assistant/postgrad/unpublished colleague that you’ll add your name to their publication and your chums in the peer review pool will let it through;
- Comment on everything anyone sends you or tells you – where possible, make changes inline to any document that anyone sends you (rather than commenting in the margins) – and make it so difficult to to disentangle what you’ve added that they’re forced to give you an author credit. Alternatively, where possible, make structural changes to the organisation of a paper early on so that other authors think you’ve contributed more than you have… Reinforce these by commenting on “the paper you’re writing with X” to every one else so they think it actually is a joint paper;
- Give your work away because you’re too lazy to write it up – start a mentoring or academic writing scheme, write half baked, unfinished articles and get unpublished academics or academics playing variants of the games above to finish them off for you.
- Identify someone who has lot of started but not quite finished papers and offer to help them bring them to completion in exchange for an authorship credit.
Note that some of the options may be complementary and allow two people to exploit each other…
Trying to clear my head of code on a dog walk after a couple of days tinkering with the nomis API and I started to ponder what an API is good for.
Chris Gutteridge and Alex Dutton’s open data excuses bingo card and Owen Boswarva’s Open Data Publishing Decision Tree both suggest that not having an API can be used as an excuse for not publishing a dataset as open data.
So what is an API good for?
I think one naive view is that this is what an API gets you…
It doesn’t of course, because folk actually want this…
Which is not necessarily that easy even with an API:
For a variety of reasons…
Even when the discovery part is done and you think you have a reasonable idea of how to call the API to get the data you want out of it, you’re still faced with the very real practical problem of how to actually get the data in to the analysis environment in a form that is workable on in that environment. Just because you publish standards based SDMX flavoured XML doesn’t mean anything to anybody if they haven’t got an “import from SDMX flavoured XML directly into some format I know how to work with” option.
And even then, once the data is in, the problems aren’t over…
(I’m assuming the data is relatively clean and doesn’t need any significant amount of cleaning, normalising, standardising, type-casting, date par;-sing etc etc. Which is of course likely to be a nonsense assumption;-)
So what is an API good for, and where does it actually exist?
I’m starting to think that for many users, if there isn’t some equivalent of an “import from X” option in the tool they are using or environment they’re working in, then the API-for-X is not really doing much of a useful job for them.
Also, if there isn’t a discovery tool they can use from the tool they’re using or environment they’re working in, then finding data from service X turns into another chore that takes them out of their analysis context and essentially means that the API-for-X is not really doing much of a useful job for them.
What I tried to do in doodling the Python / pandas Remote Data Access Wrapper for the Nomis API for myself create some tools that would help me discover various datasets on the nomis platform from my working environment – an IPython notebook – and then fetch any data I wanted from the platform into that environment in a form in which I could immediately start working with it – which is to say, typically as a pandas dataframe.
I haven’t started trying to use it properly yet – and won’t get a chance to for a week or so at least now – but that was the idea. That is, the wrapper should support a conversation with the discovery and access parts of the conversation I want to have with the nomis data from within my chosen environment. That’s what I want from an API. Maybe?!;-)
And note further – this does not mean things like a pandas Remote Data Access plugin or a CRAN package for R (such as the World Bank Development Indicators package or any of the other data/API packages referenced from the rOpenSci packages list should be seen as extensions of the API. At worst, they should be seen as projections of the API into user environments. At best, it is those packages that should be seen as the actual API.
APIs for users – not programmers. That’s what I want from an API.
PS See also this response from @apievangelist: The API Journey.
A couple of days ago, I came across a dataset on figshare (a data sharing site) detailing the article processing charges (APCs) paid by the University of Portsmouth to publishers in 2014. After I casually (lazily…;-) remarked on the existence of this dataset via Twitter, Owen Stephens/@ostephens referred me to a JISC project that is looking at APCs in more detail, with prototype data explorer here: All APC demonstrator [Github repository].
The project looks as if it is part of Jisc Collections’ look at the Total Cost of Ownership in the context of academic publishing, summing things like journal subscription fees along side “article processing charges” (which I’d hope include page charges?).
If you aren’t in academia, you may not realise that what used to be referred to as ‘vanity publishing’ (paying to get your first novel or poetry collection published) is part of the everyday practice of academic publishing. But it isn’t called that, obviously, because your work also has to be peer reviewed by other academics… So it’s different. It’s “quality publishing”.
Peer review is, in part, where academics take on the ownership of the quality aspects of academic publishing, so if the Total Cost of Ownership project is trying to be relevant to institutions and not just to JISC, I wonder if there should also be columns in the costing spreadsheet relating to the work time academics spend reviewing other peoples’ articles, editing journals, and so on. This is different to the presentational costs, obviously, because you can’t just write paper and submit it, you have to submit it in an appropriately formatted document and “camera ready” layout, which can also add a significant amount of time to preparing a paper for publication. So you do the copy editing and layout too. And so any total costing to an academic institution of the research publishing racket should probably include this time too. But that’s by the by.
The data that underpins the demonstrator application was sourced from a variety of universities and submitted in spreadsheet form. A useful description (again via @ostephens) of the data model can be found here: APC Aggregation: Data Model and Analytical Usage. Looking at it it just seems to cover APCs.
APC data relating to the project can be found on figshare. I haven’t poked around in the demonstrator code or watched its http traffic to see if the are API calls on to the aggregated data that provide another way in to it.
As well as page charges, there are charges associated with subscription fees to publishers. Publishers don’t like this information getting out on grounds of commercial sensitivity, and universities don’t like publishing it presumably on grounds of bringing themselves into disrepute (you spend how much?!), but there is some information out there. Data from a set of FOI requests about journal subscriptions (summarised here), for example. If you want to wade through some of the raw FOI responses yourself, have a look on WhatDoTheyKnow: FOI requests: “journal costs”.
Tim Gowers also wrote compellingly about his FOI escapades trying to trying down journal subscription costs data: Elsevier journals – some facts.
This is all very well, but is it in anyway useful? I have no idea. One thing I imagined that might be quite amusing to explore was the extent to which journal subscriptions paid their way (or were “cost effective”). For example, looking at institutional logs, how often are (articles from) particular journals being accessed or downloaded either for teaching or research purposes? (Crudely: teaching – access comes from a student account; research – access from a research account.) On the other hand, for the research outputs of the institution, how many things are being published into a particular journal, and how many citations appear in those outputs to other publications.
If we take the line that use demonstrates value, and use is captured as downloads, publications into, or references into. (That’s very crude, but then I’m approaching this as a possible recreational data exercise, not a piece of formal research. And yes – I know, journals are often bundled up in subscription packages together, and just like Sky blends dross with desirable channels in its subscription deals, I suspect academic publishers do too… But then, we could start to check these based on whether particular journals in bundle are ever accessed, ever referenced, ever published into within a particular organisation, etc. Citation analysis can also help here – for example, if 5 journals all heavily cite each other, and one publisher publishes 3 of those, it could makes sense for them to bundle the journals two into one package and the third into another, so if you’re researching topics that are reported by heavily linked articles across those journals, you can essentially force people researching that topic into subscribing to both packages. Without having a look at citation network analyses and subscription bundles, I can’t check that outlandish claim of course;-)
Erm… that’s it…
PS see also Evaluating big deal journal bundles (via @kpfssport)
PPS for a view from the publishers’ side on the very real costs associated with publishing, as well as a view on how academia and business treat employment costs and “real” costs in rather contrasting ways, see Time is Money: Why Scholarly Communication Can’t Be Free.