Notes on Robot Churnalism, Part I – Robot Writers

In Some Notes on Churnalism and a Question About Two Sided Markets, I tried to pull together a range of observations about the process of churnalism, in which journalists propagate PR copy without much, if any, critique, contextualisation or corroboration.

If that view in any way represents a fair description of how some pre-packaged content, at least, makes its way through to becoming editorial content, where might the robots fit in? To what extent might we start to see “robot churnalism“, and what form or forms might it take?

There are two particular ways in which we might consider robot churnalism:

  1. “robot journalists” that produce copy acts as a third conveyor belt complementary to PA-style wire and PR feedstocks;
  2. robot churnalists as ‘reverse’ gatekeepers, choosing what wire stories to publish where based on traffic stats and web analytics.

A related view is taken by Philip Napoli (“Automated media: An institutional theory perspective on algorithmic media production and consumption.” Communication Theory 24.3 (2014): 340-360; a shorter summary of the key themes can be found here) who distinguishes roles for algorithms in “(a) media consumption and (b) media production”. He further refines the contributions algorithms may make in media production by suggesting that “[t]wo of the primary functions that algorithms are performing in the media production realm at this point are: (a) serving as a demand predictor and (b) serving as content creator.”

Robot Writers

“Automated content can be seen as one branch of what is known as algorithmic news” writes Christer Clerwall (2014, Enter the Robot Journalist, Journalism Practice, 8:5, pp519-531), a key component of automated journalism “in which a program turns data into a news narrative, made possible with limited — or even zero — human input” (Matt Carlson (2015) The Robotic Reporter, Digital Journalism, 3:3, 416-431).

In a case study based around the activities of Narrative Science, a company specialising in algorithmically created, data driven narratives, Carlson further conceptualises “automated journalism” as “algorithmic processes that convert data into narrative news texts with limited to no human intervention beyond the initial programming”. He goes on:

The term denotes a split from data analysis as a tool for reporters encompassed in writings about “computational and algorithmic journalism” (Anderson 2013) to indicate wholly computer-written news stories emulating the compositional and framing practices of human journalism (ibid, p417).

Even several years ago, Arjen van Dalen observed that “[w]ith the introduction of machine-written news computational journalism entered a new phase. Each step of the news production process can now be automated: “robot journalists” can produce thousands of articles with virtually no variable costs” (The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p649).

Sport and financial reporting examples abound from the bots of Automated Insights and Narrative Science (for example, Notes on Narrative Science and Automated Insights or Pro Publica: How To Edit 52,000 Stories at Once, and more recently e.g. Robot-writing increased AP’s earnings stories by tenfold), with robot writers generating low-cost content to attract page views, “producing content for the long tail, in virtually no time and with low additional costs for articles which can be produced in large quantities” (ibid, p649).

Although writing back in 2012, van Dalen noted in his report on “the responses of the journalistic community to automatic content creation” that:

[t]wo main reasons are mentioned to explain why automated content generation is a trend that needs to be taken seriously. First, the journalistic profession is more and more commercialized and run on the basis of business logics. The automation of journalism tasks fits in with the trend to aim for higher profit margins and lower production costs. The second reason why automated content creation might be successful is the quality of stories with which it is competing. Computer-generated news articles may not be able to compete with high quality journalism provided by major news outlets, which pay attention to detail, analysis, background information and have more lively language or humour. But for information which is freely available on the Internet the bar is set relatively low and automatically generated content can compete (ibid, p651).

As Christer Clerwall writes in Enter the Robot Journalist, (Journalism Practice, 8:5, 2014, pp519-531):

The advent of services for automated news stories raises many questions, e.g. what are the implications for journalism and journalistic practice, can journalists be taken out of the equation of journalism, how is this type of content regarded (in terms of credibility, overall quality, overall liking, to mention a few aspects) by the readers? p520.

van Dalen puts it thus:

Automated content creation is seen as serious competition and a threat for the job security of journalists performing basic routine tasks. When routine journalistic tasks can be automated, journalists are forced to offer a better product in order to survive. Central in these reflections is the need for journalists to concentrate on their own strengths rather than compete on the strengths of automated content creation. Journalists have to become more creative in their writing, offer more in-depth coverage and context, and go beyond routine coverage, even to a larger extent than they already do today (ibid, p653).

He then goes on to produce the following SWOT analysis to explore just how the humans and the robots compare:


One possible risk associated with the automated production of copy is that it becomes published without human journalistic intervention, and as such is not necessarily “known”, or even read, by any member at all of the publishing organisation. To paraphrase Daniel Jackson and Kevin Moloney, “Inside Churnalism: PR, journalism and power relationships in flux”, Journalism Studies, 2015, this would represent an extreme example of churnalism in the sense of “the use of unchecked [robot authored] material in news”.

This is dangerous, I think, on many levels. The more we leave the setting of the news agenda and the identification of news values to machines, the more we lose any sensitivity to what’s happening in the world around us and what stories are actually important to an audience as opposed to merely being Like-bait titillation. (As we shall see, algorithmic gatekeepers that channel content to audiences based on various analytics tools respond to one definition of what audiences value. But it is not clear that these are necessarily the same issues that might weigh more heavily in a personal-political sense. Reviews of the notion of “hard” vs. “soft” news (e.g. Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239)) may provide lenses to help think about this more deeply?)

Of course, machines can also be programmed to look for links and patterns across multiple sources of information and at far greater scale than a human journalist could hope to cover, but we are then in danger of creating some sort of parallel news world, where events are only recognised, “discussed” and acted upon by machines and human actors are oblivious to them. (For an example, The Wolf of Wall Tweet: A Web-reading bot made millions on the options market. It also ate this guy’s lunch that describes how bots read the news wires and trade off the back them. They presumably also read wire stories created by other bots…)

So What It Is That Robot Writers Actually Do All Day?

In a review of Associated Press’ use of Automated Insight’s Wordsmith application (In the Future, Robots Will Write News That’s All About You), Wired reported that Wordsmith “essentially does two things. First, it ingests a bunch of structured data and analyzes it to find the interesting points, such as which players didn’t do as well as expected in a particular game. Then it weaves those insights into a human readable chunk of text.”

One way of getting deeper into the mind of a robot writer is to look to the patents held by the companies who develop such applications. For example, in The Anatomy of a Robot Journalist, one process used by Narrative Science is characterised as follows:


Identifying newsworthy features (or story points) is a process of identifying features and then filtering out the ones that are somehow notable. Angles are possibly defined as in terms of sets of features that need to be present within a particular dataset for that angle to provide a possible frame for story. The process of reconciling interesting features with angle points populates the angle with known facts, and a story engine then generates the natural language text within a narrative structure suited to an explication of the selected angle.

(An early – 2012 – presentation by Narrative Science’s Larry Adams also reviews some of the technicalities: Using Open Data to Generate Personalized Stories.)

In actual fact, the process may be a relatively straightforward one, as demonstrated by the increasing numbers of “storybots” that populate social media. One well known class of examples are earthquake bots that tweet news of earthquakes (see also: When robots help human journalists: “This post was created by an algorithm written by the author”). (It’s easy enough to see various newsworthiness filters might work here: a geo-based one for reporting a story locally, a wider interest one for reporting an earthquake above a particular magnitude, and so on.)

It’s also easy enough to create your own simple storybot (or at least, an “announcer bot”) using something like IFTT that can take in an RSS feed and make a tweet announcement about each new item. A collection of simple twitterbots produced as part of a journalism course on storybots, along with code examples, can be found here: A classroom experiment in Twitter Bots and creativity. Here’s another example, for a responsive weatherbot that tries to geolocate someone sending a message to the bot and respond to them with a weather report for their location.

Not being of a journalistic background, and never having read much on media or communications theory, I have to admit I don’t really have a good definition for what angles are, or a typology for them in different topic areas, and I’m struggling to find any good structural reviews of the idea, perhaps because it’s so foundational? For now, I’m sticking with a definition of “an angle” as being something along the lines of the thing you want focus on and dig deeper around within the story (the thing you want to know more about or whose story you want to tell; this includes abstract things: the story of an indicator value for example, over time). The blogpost Framing and News Angles: What is Bias? contrasts angles with the notions of framing and bias. Entman, Robert M. “Framing: Towards clarification of a fractured paradigm.” McQuail’s reader in mass communication theory (1993): 390-397 [pdf] seems foundational in terms of the framing idea, De Vreese, Claes H. “News framing: Theory and typology.” Information design journal & document design 13.1 (2005): 51-62 [PDF] offers a review (of sorts) of some related literature, and Reinemann, C., Stanyer, J., Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239 (PDF) perhaps provides another way in to related literature? Bias is presumably implicit in the selection of any particular frame or angle? Blog posts such as What makes a press release newsworthy? It’s all in the news angle look to be linkbait, perhaps even stolen content (eg here’s a PDF), but I can’t offhand find a credible source or inspiration for the original list? Resource packs like this one on Working with the Media from the FAO gives a crash course into what I guess are some of the generally taught basics around story construction?

PS On the “related / to read” pile: Mary Lynn Young & Alfred Hermida (2015) From Mr. and Mrs. Outlier To Central Tendencies, Digital Journalism, 3:3, 381-397, DOI: 10.1080/21670811.2014.976409

See also: Notes on Robot Churnalism, Part II – Robots in the Journalism Workplace

Guardian Datastore MPs’ Expenses Spreadsheet as a Database

Continuing my exploration of what is and isn’t acceptable around the edges of doing stuff with other people’s data(?!), the Guardian datastore have just published a Google spreadsheet containing partial details of MPs’ expenses data over the period July-Decoember 2009 (MPs’ expenses: every claim from July to December 2009):

thanks to the work of Guardian developer Daniel Vydra and his team, we’ve managed to scrape the entire lot out of the Commons website for you as a downloadable spreadsheet. You cannot get this anywhere else.

In sharing the data, the Guardian folks have opted to share the spreadsheet via a link that includes an authorisation token. Which means that if you try to view the spreadsheet just using the spreadsheet key, you won’t be allowed to see it; (you also need to be logged in to a Google account to view the data, both as a spreadsheet, and in order to interrogate it via the visualisation API). Which is to say, the Guardian datastore folks are taking what steps they can to make the data public, whilst retaining some control over it (because they have invested resource in collecting the data in the form they’re re-presenting it, and reasonably want to make a return from it…)

But in sharing the link that includes the token on a public website, we can see the key – and hence use it to access the data in the spreadsheet, and do more with it… which may be seen as providing a volume add service over the data, or unreasonably freeloading off the back of the Guardian’s data scraping efforts…

So, just pasting the spreadsheet key and authorisation token into the cut down Guardian datastore explorer script I used in Using CSV Docs As a Database to generate an explorer for the expenses data.

So for example, we can run for example run a report to group expenses by category and MP:

MP expesnes explorer

Or how about claims over 5000 pounds (also viewing the information as an HTML table, for example).

Remember, on the datastore explorer page, you can click on column headings to order the data according to that column.

Here’s another example – selecting A,sum(E), where E>0 group by A and order is by sum(E) then asc and viewing as a column chart:

Datastore exploration

We can also (now!) limit the number of results returned, e.g. to show the 10 MPs with lowest claims to date (the datastore blog post explains that why the data is incomplete and to be treated warily).

Limiting results in datstore explorer

Changing the asc order to desc in the above query gives possibly a more interesting result, the MPs who have the largest claims to date (presumably because they have got round to filing their claims!;-)

Datastore exploring

Okay – enough for now; the reason I’m posting this is in part to ask the question: is the this an unfair use of the Guardian datastore data, does it detract from the work they put in that lets them claim “You cannot get this anywhere else”, and does it impact on the returns they might expect to gain?

Sbould they/could they try to assert some sort of database collection right over the collection/curation and re-presentation of the data that is otherwise publicly available that would (nominally!) prevent me from using this data? Does the publication of the data using the shared link with the authorisation token imply some sort of license with which that data is made available? E.g. by accepting the link by clicking on it, becuase it is a shared link rather than a public link, could the Datastore attach some sort of tacit click-wrap license conditions over the data that I accept when I accept the shared data by clicking through the shared link? (Does the/can the sharing come with conditions attached?)

PS It seems there was a minor “issue” with the settings of the spreadsheet, a result of recent changes to the Google sharing setup. Spreadsheets should now be fully viewable… But as I mention in a comment below, I think there are still interesting questions to be considered around the extent to which publishers of “public” data can get a return on that data?

Using Google Spreadsheets Like a Database – The QUERY Formula

In this year’s student satisfaction tables, which universities have a good teaching score but low employment prospects? How would you find out? In this post, you’ll find out…

Whether or not it was one of my resolutions, one of the things I want to do more this year is try to try to make more use of stuff that’s already out there, and come up with recipes that hopefully demonstrate to others how to make use of those resources.

So today’s trick is prompted by a request from @paulbradshaw about “how to turn a spreadsheet into a form-searchable database for users” within a Google spreadsheet (compared to querying a google spreadsheet via a URI, as described in Using Google Spreadsheets as a Database with the Google Visualisation API Query Language).

I’m not going to get as far as the form bit, but here’s how to grab details from a Google spreadsheet, such as one of the spreadsheets posted to the Guardian Datastore, and query it as if it was a database in the context of one of your own Google spreadsheets.

This trick actually relies on the original Google spreadsheet being shared in “the right way”, which for the purposes of this post we’ll take to mean – it can be viewed using a URL of the form:

(The &hl=en on the end is superfluous – it doesn’t matter if it’s not there…) The Guardian Datastore folks sometimes qualify this link with a statement of the form Link (if you have a Google Docs account).

If the link is of the form:
just change pub to ccc

So for example, take the case of the 2010-2011 Higher Education tables (described here):

The first thing to do is to grab a copy of the data into our own spreadsheet. So go to Google Docs, create a new spreadsheet, and in cell A1 enter the formula:
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)

When you hit return, the spreadsheet should be populated with data from the Guardian Datastore spreadsheet.

So let’s see how that formula is put together.
=ImportRange(“reBYenfrJHIRd4voZfiSmuw”,”Institutional Table!A1:K118″)

Firstly, we use the =ImportRange() formula, which has the form:

This says that we want to import a range of cells from a sheet in another spreadsheet/workbook that we have access to (such as one we own, one that is shared with us in an appropriate way, or a public one). The KEY is the key value from the URL of the spreadsheet we want to import data from. The SHEET is the name of the sheet the data is on:

The RANGE is the range of the cells we want to copy over from the external spreadsheet.

Enter the formula into a single cell in your spreadsheet and the whole range of cells identified in the specified sheet of the original spreadsheet will be imported to your spreadsheet.

Give the sheet a name (I called mine ‘Institutional Table 2010-2011’; the default would be ‘Sheet1’).

Now we’re going to treat that imported data as if it was in a database, using the =QUERY() formula.

Create a new sheet, call it “My Queries” or something similar and in cell A1 enter the formula:

=QUERY(‘Institutional Table 2010-2011’!A1:K118,”Select A”)

What happens? Column A is pulled into the spreadsheet is what. So how does that work?

The =QUERY() formula, which has the basic form =QUERY(RANGE,DATAQUERY), allows us to run a special sort of query against the data specified in the RANGE. That is, you can think of =QUERY(RANGE,) as specifying a database; and DATAQUERY as a database query language query (sic) over that database.

So what sorts of DATAQUERY can we ask?

The simplest queries are not really queries at all, they just copy whole columns from the “database” range into our “query” spreadsheet.

So things like:

  • =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C”) to select column C;
  • =QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D,G,H”) to select columns C, D, G and H;

So looking at copy of the data in our spreadsheet, import the columns relating to the Institution, Average Teaching Score, Expenditure per Student and Career Prospects, I’d select columns C, D, F and H:

like this:
=QUERY(‘Institutional Table 2010-2011’!A1:K118,“Select C,D, F,H”)
to give this:

(Remember that the column labels in the query refer to the spreadsheet we are treating as a database, not the columns in the query results sheet shown above.)

All well and good. But suppose we only want to look at institutions with a poor teaching score (column D), less than 40? Can we do that too? Well, yes, we can, with a query of the form:

“Select C,D, F,H where D < 40"

(The spaces around the less than sign are important… if you don’t include them, the query may not work.)

Here’s the result:

(Remember, column D in the query is actually the second selected column, which is placed into column B in the figure shown above.)

Note that we can order the results according to other columns to. So for example, to order the results according to increasing expenditure (column F), we can write:

“Select C,D, F,H where D < 40 order by F asc"

(For decreasing order, use desc.)

Note that we can run more complex queries too. So for example, if we want to find institutions with a high average teaching score (column D) but low career prospects (column H) we might ask:

“Select C,D, F,H where D > 70 and H < 70"

And so on…

Over the nect week or two, I’ll post a few more examples of how to write spreadsheet queries, as well as showing you a trick or two about how to build a simple form like interface within the spreadsheet for constructing queries automatically; but for now, why try having a quick play with the =QUERY() formula yourself?

My Presentation for News:Rewired – Doing the Data Mash

For once, I didn’t put links into a presentation, so here instead are the link resources for my News:Rewired presentation:

(If I get a chance over the next week or so, I may even try to make a slidecast out of the above…)

The link story for the presentation goes something like this:

If there’s something “dataflow” related you’d like see explored here, please leave a request as a comment and I’ll see what I can do :-) I’ve also started a newsrw) category (view it here) which I’ll start posting relevant content to; (see also the datajourn tag).

Tinkering With Timetric – London Datastore Borough Population Data

Earlier this week, the Greater London Authority opened up a preview/prototype of the London Datastore (news reports).

Many of the data sets I quickly looked at are being made available as CSV and XML data feeds, which is very handy :-)

Anyway, in preparation for having some new recipes to drop into conversation at News:Rewired next week, I thought I’d have a quick play with visualising some of the timeseries data in Timetric to see what sorts of “issues” it might throw up.

So how does Timetric like to import data? There are three main ways – copy’n’paste, import a spreadsheet (CSV or XLS) from your desktop, or grab the data from a URL.

Obviously, the online route appeals to me:-)

Secondly, how does Timetirc expect the data to be formatted? At the moment, quite rigidly, it seems:

To publish data in a format Timetric can understand, you should expose it in one of two formats — either CSV or Excel (.xls) format. Dates/times must be in the first column of the file, and values in the second.

For importing CSV data, the date/time should be in W3C datetime format (like 2009-09-14 17:34:00) or, optionally, Unix timestamps as floating point numbers.

Hmmm, so at the moment, I can only import on time series at a time, unless I’m a geeky hacker type and know how to “write a programme” to upload multiple sets of data from a multi-column file via the API… But isn’t about that, right?!;-)

Let’s look at some of the London datastore data, anyway. How about this – “Historic Census Population” data.

Let’s preview the data in a Google spreadsheet – use the formula:

Ok – so we have data for different London Boroughs, for every decade since 1801. But is the data in the format that Timetric wants?

– first no: the dates run across columns rather than down rows.

So we need to swap rows with columns somehow. We can do this in a Google spreadsheet with the TRANSPOSE formula. While we’re doing the transposition, we might as well drop the Area Code column and just use the area/borough names. In a new sheet, use the formula:
=TRANSPOSE( ‘Original Data’!A1:W )
(Note, I’d renamed the sheet containing the imported data as Original Data; typically it would be Sheet1, by default.)

NB It seems I could have combined the import and transpose formulae:

Now we hit the second no: the dates are in the wrong format.

Remember, for Timetric “the date/time should be in W3C datetime format (like 2009-09-14 17:34:00) or, optionally, Unix timestamps as floating point numbers.”

My fudge here was to copy all the data except the time data to a new sheet, and just add the time data by hand, using a default day/month/time of midnight first January of the appropriate year. Note that this is not good practice – the data in this sheet is now not just a representation of the original data, it’s been messed around with and the data field is not the original one, nor even derived from the original one (I don’t think Google spreadsheets has a regular expression search/replace formula that would allow me to do this?)

Anyway, that’s as may bee;-). To keep the correct number format (Google spreadsheets will try to force a different representation of the date), the format of the date cells needs to be set explicitly:

So now we have the data in rows, with the correct data format, the dates being added by hand. Remembering that Timetric can only import one time series at a time, let’s try with the first data set. We can grab the CSV for the first two columns as follows – from the Share Menu, “Publish as Web Page” option, choose the following settings:

(The ‘for timetric’ sheet is the sheet with the tidied date field.)

Here’s the CSV URI, that we can use to get the data in Timetric:

The upload took a good couple of minutes, with no reassuring user notifications (just the browser appearing to hang waiting for a new timetric page to load), but evntually it got there…

(And yes, that drop in population is what the data says – though for all the other boroughs you get a curve shaped more as you’d expect;-)

To import other data sets, we need to insert a new Date column, along with dat data (I copied it from the first Dat column I’d created) and then grab the CSV URI for the appropriate columns:

Anyway, there we have it – a recipe (albeit a slightly messy one) for getting CSV data out of the London datastore, into a Google spreadsheet, transposing its rows and columns, and then generating date information formatted just how Timetric likes it, before grabbing a new CSV data feed out of the spreadsheet and using it to import data into Timetric.