Category: Thinkses

Fragments – Should Algorithms, Deep Learning AI Models and/or Robots be Treated as Employees?

Ever ones for pushing their luck when it comes to respecting codes, regulations, and maybe even the law, Uber hit the news again last when it turned it out at least one of its automated vehicles was caught running a red light (Uber blames humans for self-driving car traffic offenses as California orders halt).

My first thought on this was to wonder who’s nominally in control of an automated vehicle of the vehicle itself detects a likely accident and quickly hands control over to a human driver (“Sh********t… you-drive”), particularly if reaction times are such that even the most attentive human operator doesn’t have time to respond properly.

My second was to start pondering the agency associated with an algorithm, particularly a statistical one where the mapping from inputs to outputs is not necessarily known in advance but is based on an expectation that the model used by the algorithm will give an “appropriate” response based on the training and testing regime.

[This is a v quick and poorly researched post, so some of the references are first stabs as I go fishing… They could undoubtedly be improved upon… If you can point me to better readings (though many appear to be stuck in books), please add them to the comments…]

In the UK, companies can be registered as legal entities; as such, they can act as an employer and become “responsible” for the behaviour of their employees through vicarious liability ([In Brief] Vicarious Liability: The Liability of an Employer).

According to solicitor Liam Lane, writing for HR Magazine (Everything you need to know about vicarious liability):

Vicarious liability does not apply to all staff. As a general rule, a business can be vicariously liable for actions of employees but not actions of independent contractors such as consultants.

This distinction can become blurred for secondees and agency workers. In these situations, there are often two ‘candidates’ for vicarious liability: the business that provided the employee, and the business that received him or her. To resolve this, courts usually ask:

(i) which business controls how the employee carries out his or her work; and

(ii) which business the employee is more integrated into.

Similarly, vicarious liability does not apply to every wrongful act that an employee carries out. A business is only vicariously liable for actions that are sufficiently close to what the employee was employed to do.

The CPS guidance on corporate prosecutions suggests that “A corporate employer is vicariously liable for the acts of its employees and agents where a natural person would be similarly liable (Mousell Bros Ltd v London and North Western Railway Co [1917] 2 KB 836).”

Findlaw (and others) also explore limitations, or otherwise, to an employer’s liability by noting the distinction between an employee’s “frolics and detours”: A detour is a deviation from explicit instructions, but so related to the original instructions that the employer will still be held liable. A frolic on the other hand, is simply the employee acting in his or her own capacity rather than at the instruction of an employer.

Also on limitations to employer liability, a briefing note note from Gaby Hardwicke Solicitors (Briefing Note: Vicarious Liability and Discrimination) brings all sorts of issues to mind. Firstly, on scope:

Vicarious liability may arise not just where an employment contract exists but also where there is temporary deemed employment, provided either the employer has an element of control over how the “employee” carries out the work or where the “employee” is integrated into the host’s business. The employer can be vicariously liable for those seconded to it and for temporary workers supplied to it by an employment business. In Hawley v Luminar Leisure Ltd, a nightclub was found vicariously liable for serious injuries as a result of an assault by a doorman, which it engaged via a third party security company.

The Equality Act 2010 widens the definition of employment for the purposes of discrimination claims so that the “employer” is liable for anything done by its agent, under its authority, whether within its knowledge or not. This would, therefore, confer liability for certain acts carried out by agents such as solicitors, HR consultants, accountants, etc.

In addition, the briefing describes how “Although vicarious liability is predominantly a common law concept, for the purposes of anti-discrimination law, it is enshrined in statute under section 109 Equality Act 2010. This states that anything that is done by an individual in the course of his employment, must be treated as also done by the employer, unless the employer can show that it took all reasonable steps to prevent the employee from doing that thing or from doing anything of that description.

So, I’m wondering… Companies act through their employees and agents. To what extent might we start to see “algorithms” and/or “computational models” (eg trained “Deep Learning”/neural networks models) starting to be treated as legal entities in their own right, at least in so far as they may be identified qua employees or agents when it comes to acting on behalf of a company. When one company licenses an algorithm/model to another, how will any liability be managed? Will algorithms and models start to have their own (employment) agents? Or are statistical models and algorithms (with parameters set) actually just patentable inventions, albeit with very specifically prescribed dimensions and attribute values?

In terms of liability, companies currently seem keen to try on wriggling around accountability by waving their hands in the air when an algorithm occurs. But when does the accountability lie, and where does the agency lie (in the sense that algorithms and models make (automated) decisions on behalf of their operators? Are there any precedents around ruling decision making algorithms as something akin to “employees” when it comes to liability? Or companies arguing for (or against) such claims? Can an algorithm be defined as a company, with its articles and objects enshrined in code, and if so, how is liability then limited as far as its directors are concerned?

I guess what I’m wondering is: are we going to see algorithms/models/robots becoming entities in law? Whether defined as their own class of legal entity, companies, employees, agents, or some other designation?

PS pointers to case law and examples much appreciated. Eg this sort of thing is maybe relevant? Search engine liability for autocomplete suggestions: personality, privacy and the power of the algorithm in a discussion of how an algorithm operator is liable for the actions of the algorithm?

From Linked Data to Linked Applications?

Pondering how to put together some Docker IPython magic for running arbitrary command line functions in arbitrary docker containers (this is as far as I’ve got so far), I think the commands must include a couple of things:

  1. the name of the container (perhaps rooted in a particular repository): psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example;
  2. the actual command to be called: for example, one of the contentine commands: getpapers -q {QUERY} -o {OUTPUTDIR} -x

We might also optionally specify mount directories with the calling and called containers, using a conventional default otherwise.

This got me thinking that the called functions might be viewed as operating in a namespace (psychemedia/contentmine or dockerhub::psychemedia/contentmine, for example). And this in turn got me thinking about “big-L, big-A” Linked Applications.

According to Tim Berners Lee’s four rules of Linked Data, the web of data should:

  1. Use URIs as names for things
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information, using the standards (RDF*, SPARQL)
  4. Include links to other URIs. so that they can discover more things.

So how about a web of containerised applications, that would:

  1. Use URIs as names for container images
  2. Use HTTP URIs so that people can look up those names.
  3. When someone looks up a URI, provide useful information (in the minimal case, this corresponds to a Dockerhub page for example; in a user-centric world, this could just return a help file identifying the commands available in the container, along with help for individual commands; )
  4. Include a Dockerfile. so that they can discover what the application is built from (also may link to other Dockerfiles).

Compared with Linked Data, where the idea is about relating data items one to another, the identifying HTTP URI actually represents the ability to make a call into a functional, execution space. Linkage into the world of linked web resources might be provided through Linked Data relations that specify that a particular resource was generated from an instance of a Linked Application or that the resource can be manipulated by an instance of a particular application.

So for example, files linked to on the web might have a relation that identifies the filetype, and the filetype is linked by another relation that says it can be opened in a particular linked application. Another file might link to a description of the workflow that created it, and the individual steps in the workflow might link to function/command identifiers that are linked to linked applications through relations that associate particular functions with a particular linked application.

Workflows may be defined generically, and then instantiated within a particular experiment. So for example: load file with particular properties, run FFT on particular columns, save output file becomes instantiated within a particular run of an experiment as load file with this URI, run the FFT command from this linked application on particular columns, save output file with this URI.

Hmm… thinks.. there is a huge amount of work already done in the area of automated workflows and workflow execution frameworks/environments for scientific computing. So this is presumably already largely solved? For example, Integrating Containers into Workflows: A Case Study Using Makeflow, Work Queue, and Docker, C. Zheng & D. Thain, 2015 [PDF]?

A handful of other quick points:

  • the model I’m exploring in the Docker magic context is essentially stateless/serverless computing approach, where a commandline container is created on demand and treated in a disposable way to just run a particular function before being destroyed; (see also the OpenAPI approach).
  • The Linked Application notion extends to other containerised applications, such as ones that expose an HTML user interface over http that can be accessed via a browser. In such cases, things like WSDL (or WADL; remember WADL?) provided a machine readable formalised way of describing functional resource availability.
  • In the sense that commandline containerised Linked Applications are actually services, we can also think about web services publishing an http API in a similar way?
  • services such as Sandstorm, which have the notion of self-running containerised documents, have the potentially to actually bind a specific document within an interactive execution environment for that document.

Hmmm… so how much nonsense is all of the above, then?

Some Idle Thoughts on Managing Temporal Posts in WordPress

Now that I’ve got a couple of my own WordPress blogs running off the back of my Reclaim Hosting account, I’ve started to look again at possible ways of tinkering with WordPress.

The first thing I had a look at was posting a draft WordPress post from a script.

Using a WordPress role editor plugin (e.g. a long the lines of this User Role Editor) it’s easy enough to create a new role with edit and upload permissions only [WordPress roles and capabilities], and create a new ‘autoposter’ user with that role. Code like the following then makes it easy enough to upload an image to WordPress, grab the URL, insert it into a post, and then submit the post – where it will, by default, appear as a draft post:

#Ish Via:
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
from wordpress_xmlrpc.methods.posts import NewPost


def wp_simplePost(client,title='ping',content='pong, <em>pong<em>'):
    post = WordPressPost()
    post.title = title
    post.content = content
    response =
    return response

def wp_uploadImageFile(client,filename):

    mimes={'png':'image/png', 'jpg':'image/jpeg'}
    # prepare metadata
    data = {
            'name': filename,
            'type': mimetype,  # mimetype

    # read the binary file and let the XMLRPC library encode it into base64
    with open(filename, 'rb') as img:
            data['bits'] = xmlrpc_client.Binary(

    response =
    return response

def quickTest():
    txt = "Hello World"
    txt=txt+'<img src="{}"/><br/>'.format(wp_uploadImageFile(wp,'hello2world.png')['url'])
    return txt


Dabbling with this then got me thinking about the different sorts of things that WordPress allows you to publish in general. It seems to me that there are essentially three main types of thing you can publish:

  1. posts: the timestamped elements that appear in a reverse chronological order in a WordPress blog. Posts can also be tagged and categorised and viewed via a tag or category page. Posts can be ‘persisted’ at the top of the posts page by setting them as a “sticky” post.
  2. pages: static content pages typically used to contain persistent, unchanging content. For example, an “About” page. Pages can also be organised hierarchically, with child subpages defined relative to a specified ‘parent’ page.
  3. sidebar elements and widgets: these can contain static or dynamic content.

(By the by, a range of third party plugins appear to support the conversion of posts to pages, for example Post Type Switcher [untested] or the bulk converter Convert Post Types [untested].)

Within a page or a post, we can also include a shortcode element that can be used to include a small piece of templated text or generated from the execution of some custom code (which it seems could be python: running a python script from a WordPress shortcode). Shortcodes run each time a page is loaded, although you can use the WordPress Transients database API to implement a simple cache for them to improve performance (eg as described here and here).

Within a post, page or widget, we can also embed dynamic content. For example, we could embed a map that displays dynamically created markers that are essentially out of the control of the page or post publisher. Note that by default WordPress strips iframes from content (and it also seems reluctant to allow the upload of html files to the media gallery, at least by default). The preferred way to include custom embedded content seems to be to define a shortcode to embed the required content, although there are plugins around that allow you to embed iframes. (I didn’t spot one that let you inline the content of the iframe using srcdoc though?)

When we put together the Isle of Wight planning applications : Mapped page, one of the issues related to how updates to the map should be posted over time.


That is, should the map be uploaded to a fixed page and show only the most recent data, should it be posted as a timestamped post, to provide archival copies of the page, or should it be posted to a page and support a timeslider/history function?

Thinking about this again, the distinction seems to rely on what sort of (re)discovery we want to encourage or support. For example, if the page is a destination page, then we should probably use a page with a fixed URL for the most recent map. Older maps could be accessed via archive links, or perhaps subpages, if a time-filter wasn’t available on a single map view. Alternatively, we might want to alert readers to the map, in which case it might make more sense to use a timestamped post. (We could of course use a post to announce an update to the page, perhaps including a screenshot of the latest map in the post.)

It also strikes me that we need to consider publication schedules by a news outlet compared to the publication schedules associated with a particular dataset.

For example, Land Registry House Prices Paid data is published on a monthly basis a few weeks after each month the data has been collected for. In this case, it probably makes sense to publish on a monthly basis.

But what about care home or food outlet inspection data? The CQC publish data as it becomes available, although searches support the retrieval of data for a particular area published over the last week or last month relative the time the search is made. The Food Standards Agency produce updates to data download files on a daily basis, but the file for any particular area is only updated when it contains new data. (So on any given day, you don’t know which, if any, area files will be updated.)

In this case, it may well be that a news outlet may want to do a couple of things:

  • publish summaries of reports over the last week or last month, on a weekly or monthly schedule – “The CQC published reports for N care homes in the region over the last month, of which X were positive and Y were negative”, etc.
  • engage in a more immediate or responsive publication of stories around particular reports as they are published by the responsible agency. In this case, the journalist needs to find a way of discovering stories in a timely fashion, either through signing up to alerts or inspecting the agency site on a regular basis.

Again, it might be that we can use posts and pages in complementary way: pages that act as fixed destination sites with a fixed URL, and perhaps links off to archived historical sub-pages, as well as related news stories, that contain the latest summary; and posts that announce timely reports as well as ‘page updated’ announcements when the slower-changing page is updated.

More abstractly, it probably makes sense to consider the relative frequencies with which data is originally published (also considering whether the data is published according to a fixed schedule, or in a more responsive way as and when data becomes available), the frequency with which journalists check the data site, and the frequency with which journalists actually publish data related stories.

Robot Journalists or Robot Press Secretaries? Why Automated Reporting Doesn’t Need to be That Clever

Automated content generators, aka robot journalists, are turning everywhere at the moment, it seems: the latest to cross my radar being a mention of “Dreamwriter” from Chinese publisher Tencent (End of the road for journalists? Tencent’s Robot reporter ‘Dreamwriter’ churns out perfect 1,000-word news story – in 60 seconds) to add to the other named narrative language generating bots I’m aware of, Automated Insight’s Wordsmith and Narrative Science’s Quill, for example.

Although I’m not sure of the detail, I assume that all of these platforms make use of quite sophisticated NLG (natural language generation) algorithms, to construct phrases, sentences, paragraphs and stories from atomic facts, identified story points, journalistic tropes and encoded linguistic theories.

One way of trying to unpick the algorithms is to critique, or even try to reverse engineer, stories known to be generated by the automatic content generators, looking for clues as to how they’re put together. See for example this recent BBC News story on Robo-journalism: How a computer describes a sports match.

Chatting to media academic Konstantin Dörr/@kndoerr in advance of the Future of Journalism conference in Cardiff last week (I didn’t attend the conference, just took the opportunity to grab a chat with Konstantin a couple of hours before his presentation on the ethical challenges of algorithmic journalism) I kept coming back to thoughts raised by my presentation at the Community Journalism event the day before [unannotated slides] about the differences between what I’m trying to explore and these rather more hyped-up initiatives.

In the first place, as part of the process, I’m trying to stay true to posting relatively simple – and complete – recipes that describe the work I’m doing so that others can play along. Secondly, in terms of the output, I’m not trying to do the NLG thing. Rather, I’m taking a template based approach – not much more than a form letter mail merge approach – to putting data into a textual form. Thirdly, the audience for the output is not the ultimate reader of a journalistic piece; rather, the intended audience is an intermediary, a journalist or researcher who needs an on-ramp providing them with useable access to data relevant to them that they can then use as the possible basis for a story.

In other words, the space I’m exploring is in-part supporting end-user development / end user programming (for journalist end-users, for example), in part automated or robotic press secretaries (not even robot reporters; see for example Data Reporting, not Data Journalism?) – engines that produce customised press releases from a national dataset at a local level that report a set of facts in a human readable way, perhaps along with supporting assets such as simple charts and very basic observational analysis (this month’s figures were more than last month’s figures, for example).

This model is one that supports a simple templated approach for a variety of reasons:

  1. each localised report has the same form as any other localised report (eg a report on jobseeker’s allowance figures for the Isle of Wight can take the same form as a report for Milton Keynes);
  2. it doesn’t matter so much if the report reads a little strangely, as long as the facts and claims are correct, because the output is not intended for final publication, as is, to the public – rather, it could be argued that it’s more like a badly written, fact based press statement that at least needs to go through a copy editor! In other words, we can start out scruffy…
  3. the similarity in form of one report to another is not likely to be a distraction to the journalist in the way that it would be to a general public reader presented with several such stories and expecting an interesting – and distinct – narrative in each one. Indeed, the consistent presentation might well aid the journalist in quickly spotting the facts and deciding on a storyline and what contextualisation may be required to add further interpretative value to it.
  4. targeting intermediary users rather than end user: the intermediary users all get to add their own value or style to the piece before the wider publication of the material, or use the data in support of other stories. That is, the final published form is not decided by the operator of the automatic content generator; rather, the automatically generated content is there to be customised, augmented, or used as supporting material, by an intermediary, or simply act as a “conversational” representation of a particular set of data provided to an intermediary.


The generation of the local datasets rom the national dataset is trivial – having generated code to slice out one dataset (by postcode or local authority, for example), we can slice out any other. The generation of the press releases from the local datasets can make use of the same template. This can be applied locally (a hyperlocal using it’s own template, for example) or centrally created and managed as part of a datawire service.

At the moment, the couple of automatically generated stories published with OnTheWight have been simple fact reporting, albeit via a human editor, rather than acting as the starting point for a more elaborate, contextualised, narrative report. But how might we extend this approach?

In the case of Jobseeker’s Allowance figures, contextualising paragraphs such as the recent closure of a local business, or the opening of another, as possible contributory factors to any month on month changes to the figures, could add colour or contextualisation to a monthly report.

Or we might invert the use of the figures, adding them as context to workforce, employment or labour related stories. For example, in the advent of a company closure, contextualisation of what the loss of numbers relative to local unemployment figures. (This fact augmented reporting is more likely to happen if the figures are readily available/to hand, as they are via autoresponder channels such as a Slackbot Data Wire.)

But I guess we have to start somewhere! And that somewhere is the simple (automatically produced, human copy edited) reporting of the facts.

PS in passing, I note via Full Fact that the Department of Health “will provide press officers [with an internal ‘data document’] with links to sources for each factual claim made in a speech, as well as contact details for the official or analyst who provided the information”, Department of Health to speed up responses to media and Full Fact. Which gets me thinking: what form might a press office publishing “data supported press releases” take, cf. a University Expert Press Room or Social Media Releases and the University Press Office, for example?

Writing Each Row of a Spreadsheet as a Press Release?

A few days ago, I saw via the @HSCICOpenData Twitter feed that an annually released dataset on Written Complaints in the NHS has just been published.

The data comes in the form of a couple of spreadsheets in which each row describes a count of the written complaints received and upheld under a variety of categories for each GP and dental practice, or local NHS trust.

The practice level spreadsheet looks like this:


Each practice is identified solely by a practice code – to find the name and address of the actual practice requires looking up the code in another dataset.

The column headings supplied in the CSV document only partially identify each column (and indeed, there are duplicates such as Total number of written complaints received, that a spreadsheet reader might disambiguate by adding numerical suffix to) – a more complete description (that shows how the columns are actually hierarchically defined) is provided in an associated metadata spreadsheet.


For a reporter wanting to know whether or not any practices in their area fared particularly badly in terms of the number of upheld complaints, the task might be broken down as follows:

  1. identify the practices in of interest from their practice codes (which requires finding a set of practice codes of interest);
  2. for each of those practices, look along the row to see whether or not there are any large numbers in the complaints upheld column.

But if you have a spreadsheet with 10, 20, 30 or more columns, scanning along a row looking for items of interest can rapidly become quite a daunting task.

So an idea I’ve been working on, which I suspect harkens back to the earliest days of database reporting, is to look at ways of turning each row of data into a text based, human readable report.

Something like the following, for example:


Each record, each “Complaint Report”, is a textual rendering of a single row from the practice complaints spreadsheet, with a bit of administrative metadata enrichment in the form of the practice name, address (and in later versions, telephone number).

These reports are quicker to scan, and could be sort or highlighted depending on the number of upheld complaints, for example. The journalist can then quickly review the reports, and identify any practices that might be worth phoning up for a comment to ask why they appear to have received a large number of upheld complaints in a particular area, for example… Data driven press releases used to assist reporting, in other words.

FWIW, I popped up a sketch script that generates the above report from the data, and also pulls in practice administrative metadata from an epracurr spreadsheet, here: NHS complaints spreadsheet2text sketch. See also: Data Driven Press Releases From HSCIC Data – Diabetes Prescribing.

PS I’m not Microsoft Office suite user, but I suspect you can get a fair way along this sort of process by using a mail merge? There may be other ways of generating templated reports too. Any Microsoft Office users fancy letting me know how you’d go about doing something like the above in Word and Excel? I’d guess complicating factors are the requirements to make use of the column headers and only display the items associated with non-zero counts, which perhaps requires some macro magic? Things could perhaps be simplified by reshaping the data, perhaps putting it into a long form by melting the complaints columns, or melting the complaints columns cannily to provide two value columns, once for complaints received and one for complaints upheld?


Then you could filter out the blank rows before the merge.

Notes on Robot Churnalism, Part II – Robots in the Journalism Workplace

In the first part of this series (Notes on Robot Churnalism, Part I – Robot Writers), I reviewed some of the ways in which robot writers are able to contribute to the authoring of news content.

In this part, I will consider some of the impacts that might arise from robots entering the workplace.

Robot Journalism in the Workplace

“Robot journalists” have some competitive advantages which are hard for human journalists to compete with. The strengths of automated content generation are the low marginal costs, the speed with which articles can be written and the broad spectrum of sport events which can be covered.
Arjen van Dalen, The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p652

One thing machines do better is create value from large amounts of data at high speed. Automation of process and content is the most under-explored territory for reducing costs of journalism and improving editorial output. Within five to 10 years, we will see cheaply produced information monitored on networks of wireless devices.
Post Industrial Journalism: Adapting to the Present, Chris Anderson, Emily Bell, Clay Shirky, Tow Center for Digital Journalism Report, December 3, 2014

Year on year, it seems, the headlines report how the robots are coming to take over a wide range of professional jobs and automate away the need to employ people to fill a wide range of currently recognised roles (see, for example, this book: The Second Machine Age [review], this Observer article: Robots are leaving the factory floor and heading for your desk – and your job, this report: The Future of Employment: How susceptible are jobs to computerisation? [PDF], this other report: AI, Robotics, and the Future of Jobs [review], and this business case: Rethink Robotics: Finding a Market).

Stories also abound fearful of a possible robotic takeover of the newsroom: ‘Robot Journalist’ writes a better story than human sports reporter (2011), The robot journalist: an apocalypse for the news industry? (2012), Can an Algorithm Write a Better News Story Than a Human Reporter? (2012), Robot Writers and the Digital Age (2013), The New Statesman could eventually be written by a computer – would you care? (2013), The journalists who never sleep (2014), Rise of the Robot Journalist (2014), Journalists, here’s how robots are going to steal your job (2014), Robot Journalist Finds New Work on Wall Street (2015).

It has to be said, though, that many of these latter “inside baseball” stories add nothing new, perhaps reflecting the contributions of another sort of robot to the journalistic process: web search engines like Google…

Looking to the academic literature, in his 2015 case study around Narrative Science, Matt Carlson describes how “public statements made by its management reported in news about the company reveal two commonly expressed beliefs about how its technology will improve journalism: automation will augment— rather than displace — human journalists, and it will greatly expand journalistic output” p420 (Matt Carlson (2015), The Robotic Reporter, Digital Journalism, 3:3, 416-431).

As with the impact of many other technological innovations within the workplace, “[a]utomated journalism’s ability to generate news accounts without intervention from humans raises questions about the future of journalistic labor” (Carlson, 2015, p422). In contrast to the pessimistic view that “jobs will lost”, there are at least two possible positive outcomes for jobs that may result from the introduction of a new technology: firstly, that the technology helps transform the original job and in so doing help make it more rewarding, or allows the original worker to “do more”; secondly, that the introduction of the new technology creates new roles and new job opportunities.

On the pessimistic side, Carlson describes how:

many journalists … question Narrative Science’s prediction that its service would free up or augment journalists, including Mathew Ingram (GigaOm, April 25, 2012): “That’s a powerful argument, but it presumes that the journalists who are ‘freed up’ because of Narrative Science … can actually find somewhere else that will pay them to do the really valuable work that machines can’t do. If they can’t, then they will simply be unemployed journalists.” This view challenges the virtuous circle suggested above to instead argue that some degree of displacement is inevitable.(Carlson, 2015, p423)

On the other hand:

[a]ccording to the more positive scenario, machine-written news could be complementary to human journalists. The automation of routine tasks offers a variety of possibilities to improve journalistic quality. Stories which cannot be covered now due to lack of funding could be automated. Human journalists could be liberated from routine tasks, giving them more time to spend on quality, in-depth reporting, investigative reporting. (van Dalen, p653)

This view thus represents the idea of algorithms working alongside the human journalists, freeing them up from the mundane tasks and allow them to add more value to a story… If a journalist has 20 minutes to spend on a story, if that time is spent searching a database and pulling out a set of numbers that may not even be very newsworthy, how much more journalistically productive could that journalist be if a machine gave them the data and a canned summary of it for free, then allowing the journalist to use the few minutes allocated to that story to take the next step – adding in some context, perhaps, or contacting a second source for comment?

A good example of the time-saving potential of automated copy production can be seen in the publication of earnings reports by AP, as reported by trade blog, who quoted vice president and managing editor Lou Ferrara’s announcement of a tenfold increase in stories from 300 per quarter produced by human journalists, to 3,700 with machine support (AP uses automation to increase story output tenfold, June, 2015).

The process AP went through during testing appears to be one that I’m currently exploring with my hyperlocal, OnTheWight, for producing monthly JobSeekers Allowance reports (here’s an example of the human produced version, which in this case was corrected after a mistake was spotted when checking that an in-testing machine generated version of the report was working correctly..! As reported about AP, “journalists were doing all their own manual calculations to produce the reports, which Ferrara said had ‘potential for error’.” Exactly the same could have been said of the OnTheWight process…)

In the AP case, “during testing, the earnings reports were produced via automation and journalists compared them to the relevant press release and figured out bugs before publishing them. A team of five reporters worked on the project, and Ferrara said they still had to check for everything a journalist would normally check for, from spelling mistakes to whether the calculations were correct.” (I wonder if they check the commas, too?!) The process I hope to explore with OnTheWight builds in the human checking route, taking the view that the machine should generate press-release style copy that does the grunt work in getting the journalist started on the story, rather than producing the complete story for them. At AP, it seems that automation “freed up staff time by one fifth”. The process I’m hoping to persuade OnTheWight to adopt is that to begin with, the same amount of time should be spent on the story each month, but month on month we automate a bit more and the journalistic time is then spent working up what the next paragraph might be, and then in turn automate the production of that…

Extending the Promise?

In addition to time-saving, there is the hope that the wider introduction of robot journalists will create new journalistic roles:

Beyond questions of augmentation or elimination, Narrative Science’s vision of automated journalism requires the transformation of journalistic labor to include such new positions as “meta-writer” or “metajournalist” to facilitate automated stories. For example, Narrative Science’s technology can only automate sports stories after journalists preprogram it with possible frames for sports stories (e.g., comeback, blowout, nail-biter, etc.) as well as appropriate descriptive language. After this initial programming, automated journalism requires ongoing data management. Beyond the newsroom, automated journalism also redefines roles for non-journalists who participate in generating data. (Carlson, 2015, p423)

In the first post of these series, I characterised the process used by Narrative Science which included the application of rules for detecting signals and angles, and the linkage of detected “facts” to story points within an a particular angle that could then be used to generate a narrative told through automatically generated natural language. Constructing angles, identifying logical processes that can identify signals and map them on to story elements, and generating turns of phrase that can help explicate narratives in a natural way are all creative acts that are likely to require human input for the near future at least, albeit tasking the human creative with the role of supporting the machine. This is not necessarily that far removed from the some of the skills already employed by journalists, however. As Carlson suggests, “Scholars have long documented the formulaic nature underlying compositional forms of news exposed by the arrival of automated news. … much journalistic writing is standardized to exclude individual voice. This characteristic makes at least a portion of journalistic output susceptible to automation” (p425). What’s changing, perhaps, is that now the journalists mush learn to capture those standardised forms and map them onto structures that act as programme fodder for their robot helpers.

Audience Development

Narrative Science also see potential in increasing the size of the total potential audience by accommodating the very specific needs of a large number of niche audiences.

“While Narrative Science flaunts the transformative potential of automated journalism to alter both the landscape of available news and the work practices of journalists, its goal when it comes to compositional form is conformity with existing modes of human writing. The relationship here is telling: the more the non-human origin of its stories is undetectable, the more it promises to disrupt news production. But even in emulating human writing, the application of Narrative Science’s automation technology to news prompts reconsiderations of the core qualities underpinning news composition. The attention to the quality and character of Narrative Science’s automated news stories reflects deep concern both with existing news narratives and with how automated journalistic writing commoditizes news stories.” Carlson, 2015, p424

In the midst of this mass of stories, it’s possible that there will be some “outliers” that are of more general interest which can, with some additional contextualisation and human reporting, be made relevant to a wider audience.

There is also the possible of searching for “meta-stories” that tell not the specifics of particular cases, but identify trends across the mass of stories as whole. (Indeed, it is by looking for such trends and patterns that outliers may be detected). In addition, patterns that only become relevant when looking across all the individual stories might in turn lead to additional stories. (For example, a failing school operated by a particular provider is perhaps of only local interest, but if it turns out that the majority of schools operated by a particular provider we turned round from excellent to failing by that provider, questions might, perhaps, be worth asking…?!)

When it comes to the case for expanding the range of content that is available, Narrative Science’s hope appears to be that:

[t]he narrativization of data through sophisticated artificial intelligence programs vastly expands the terrain of news. Automated journalism becomes a normalized component of the news experience. Moreover, Narrative Science has tailored its promotional discourse to reflect the economic uncertainty of online journalism business models by suggesting that its technology will create a virtuous circle in which increased news revenue supports more journalists (Carlson, 2015, p 421).

The alternative, fearful view, of course, is that revenues will be protected by reducing the human wage bill, using robot content creators operating at a near zero marginal cost on particular story types to replace human content creation.

Whether news organisations will use automation to extend the range of producers in the newsroom, or contribute to the reduction of human creative input to the journalistic process, is perhaps still to be seen. As Anderson, Bell & Shirky noted, “the reality is that most journalists at most newspapers do not spend most of their time conducting anything like empirically robust forms of evidence gathering.” Perhaps now is the time for them to stop churning the press releases and statistics announcements – after all, the machines can do that faster and better – and concentrate more on contextualising and explaining the machine generated stories, as well as spending more time out hunting for stories and pursuing their own investigative leads?

Notes on Robot Churnalism, Part I – Robot Writers

In Some Notes on Churnalism and a Question About Two Sided Markets, I tried to pull together a range of observations about the process of churnalism, in which journalists propagate PR copy without much, if any, critique, contextualisation or corroboration.

If that view in any way represents a fair description of how some pre-packaged content, at least, makes its way through to becoming editorial content, where might the robots fit in? To what extent might we start to see “robot churnalism“, and what form or forms might it take?

There are two particular ways in which we might consider robot churnalism:

  1. “robot journalists” that produce copy acts as a third conveyor belt complementary to PA-style wire and PR feedstocks;
  2. robot churnalists as ‘reverse’ gatekeepers, choosing what wire stories to publish where based on traffic stats and web analytics.

A related view is taken by Philip Napoli (“Automated media: An institutional theory perspective on algorithmic media production and consumption.” Communication Theory 24.3 (2014): 340-360; a shorter summary of the key themes can be found here) who distinguishes roles for algorithms in “(a) media consumption and (b) media production”. He further refines the contributions algorithms may make in media production by suggesting that “[t]wo of the primary functions that algorithms are performing in the media production realm at this point are: (a) serving as a demand predictor and (b) serving as content creator.”

Robot Writers

“Automated content can be seen as one branch of what is known as algorithmic news” writes Christer Clerwall (2014, Enter the Robot Journalist, Journalism Practice, 8:5, pp519-531), a key component of automated journalism “in which a program turns data into a news narrative, made possible with limited — or even zero — human input” (Matt Carlson (2015) The Robotic Reporter, Digital Journalism, 3:3, 416-431).

In a case study based around the activities of Narrative Science, a company specialising in algorithmically created, data driven narratives, Carlson further conceptualises “automated journalism” as “algorithmic processes that convert data into narrative news texts with limited to no human intervention beyond the initial programming”. He goes on:

The term denotes a split from data analysis as a tool for reporters encompassed in writings about “computational and algorithmic journalism” (Anderson 2013) to indicate wholly computer-written news stories emulating the compositional and framing practices of human journalism (ibid, p417).

Even several years ago, Arjen van Dalen observed that “[w]ith the introduction of machine-written news computational journalism entered a new phase. Each step of the news production process can now be automated: “robot journalists” can produce thousands of articles with virtually no variable costs” (The Algorithms Behind the Headlines, Journalism Practice, 6:5-6, 648-658, 2012, p649).

Sport and financial reporting examples abound from the bots of Automated Insights and Narrative Science (for example, Notes on Narrative Science and Automated Insights or Pro Publica: How To Edit 52,000 Stories at Once, and more recently e.g. Robot-writing increased AP’s earnings stories by tenfold), with robot writers generating low-cost content to attract page views, “producing content for the long tail, in virtually no time and with low additional costs for articles which can be produced in large quantities” (ibid, p649).

Although writing back in 2012, van Dalen noted in his report on “the responses of the journalistic community to automatic content creation” that:

[t]wo main reasons are mentioned to explain why automated content generation is a trend that needs to be taken seriously. First, the journalistic profession is more and more commercialized and run on the basis of business logics. The automation of journalism tasks fits in with the trend to aim for higher profit margins and lower production costs. The second reason why automated content creation might be successful is the quality of stories with which it is competing. Computer-generated news articles may not be able to compete with high quality journalism provided by major news outlets, which pay attention to detail, analysis, background information and have more lively language or humour. But for information which is freely available on the Internet the bar is set relatively low and automatically generated content can compete (ibid, p651).

As Christer Clerwall writes in Enter the Robot Journalist, (Journalism Practice, 8:5, 2014, pp519-531):

The advent of services for automated news stories raises many questions, e.g. what are the implications for journalism and journalistic practice, can journalists be taken out of the equation of journalism, how is this type of content regarded (in terms of credibility, overall quality, overall liking, to mention a few aspects) by the readers? p520.

van Dalen puts it thus:

Automated content creation is seen as serious competition and a threat for the job security of journalists performing basic routine tasks. When routine journalistic tasks can be automated, journalists are forced to offer a better product in order to survive. Central in these reflections is the need for journalists to concentrate on their own strengths rather than compete on the strengths of automated content creation. Journalists have to become more creative in their writing, offer more in-depth coverage and context, and go beyond routine coverage, even to a larger extent than they already do today (ibid, p653).

He then goes on to produce the following SWOT analysis to explore just how the humans and the robots compare:


One possible risk associated with the automated production of copy is that it becomes published without human journalistic intervention, and as such is not necessarily “known”, or even read, by any member at all of the publishing organisation. To paraphrase Daniel Jackson and Kevin Moloney, “Inside Churnalism: PR, journalism and power relationships in flux”, Journalism Studies, 2015, this would represent an extreme example of churnalism in the sense of “the use of unchecked [robot authored] material in news”.

This is dangerous, I think, on many levels. The more we leave the setting of the news agenda and the identification of news values to machines, the more we lose any sensitivity to what’s happening in the world around us and what stories are actually important to an audience as opposed to merely being Like-bait titillation. (As we shall see, algorithmic gatekeepers that channel content to audiences based on various analytics tools respond to one definition of what audiences value. But it is not clear that these are necessarily the same issues that might weigh more heavily in a personal-political sense. Reviews of the notion of “hard” vs. “soft” news (e.g. Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239)) may provide lenses to help think about this more deeply?)

Of course, machines can also be programmed to look for links and patterns across multiple sources of information and at far greater scale than a human journalist could hope to cover, but we are then in danger of creating some sort of parallel news world, where events are only recognised, “discussed” and acted upon by machines and human actors are oblivious to them. (For an example, The Wolf of Wall Tweet: A Web-reading bot made millions on the options market. It also ate this guy’s lunch that describes how bots read the news wires and trade off the back them. They presumably also read wire stories created by other bots…)

So What It Is That Robot Writers Actually Do All Day?

In a review of Associated Press’ use of Automated Insight’s Wordsmith application (In the Future, Robots Will Write News That’s All About You), Wired reported that Wordsmith “essentially does two things. First, it ingests a bunch of structured data and analyzes it to find the interesting points, such as which players didn’t do as well as expected in a particular game. Then it weaves those insights into a human readable chunk of text.”

One way of getting deeper into the mind of a robot writer is to look to the patents held by the companies who develop such applications. For example, in The Anatomy of a Robot Journalist, one process used by Narrative Science is characterised as follows:


Identifying newsworthy features (or story points) is a process of identifying features and then filtering out the ones that are somehow notable. Angles are possibly defined as in terms of sets of features that need to be present within a particular dataset for that angle to provide a possible frame for story. The process of reconciling interesting features with angle points populates the angle with known facts, and a story engine then generates the natural language text within a narrative structure suited to an explication of the selected angle.

(An early – 2012 – presentation by Narrative Science’s Larry Adams also reviews some of the technicalities: Using Open Data to Generate Personalized Stories.)

In actual fact, the process may be a relatively straightforward one, as demonstrated by the increasing numbers of “storybots” that populate social media. One well known class of examples are earthquake bots that tweet news of earthquakes (see also: When robots help human journalists: “This post was created by an algorithm written by the author”). (It’s easy enough to see various newsworthiness filters might work here: a geo-based one for reporting a story locally, a wider interest one for reporting an earthquake above a particular magnitude, and so on.)

It’s also easy enough to create your own simple storybot (or at least, an “announcer bot”) using something like IFTT that can take in an RSS feed and make a tweet announcement about each new item. A collection of simple twitterbots produced as part of a journalism course on storybots, along with code examples, can be found here: A classroom experiment in Twitter Bots and creativity. Here’s another example, for a responsive weatherbot that tries to geolocate someone sending a message to the bot and respond to them with a weather report for their location.

Not being of a journalistic background, and never having read much on media or communications theory, I have to admit I don’t really have a good definition for what angles are, or a typology for them in different topic areas, and I’m struggling to find any good structural reviews of the idea, perhaps because it’s so foundational? For now, I’m sticking with a definition of “an angle” as being something along the lines of the thing you want focus on and dig deeper around within the story (the thing you want to know more about or whose story you want to tell; this includes abstract things: the story of an indicator value for example, over time). The blogpost Framing and News Angles: What is Bias? contrasts angles with the notions of framing and bias. Entman, Robert M. “Framing: Towards clarification of a fractured paradigm.” McQuail’s reader in mass communication theory (1993): 390-397 [pdf] seems foundational in terms of the framing idea, De Vreese, Claes H. “News framing: Theory and typology.” Information design journal & document design 13.1 (2005): 51-62 [PDF] offers a review (of sorts) of some related literature, and Reinemann, C., Stanyer, J., Scherr, S., & Legnante, G. (2011). Hard and soft news: A review of concepts, operationalizations and key findings. Journalism, 13(2) pp221–239 (PDF) perhaps provides another way in to related literature? Bias is presumably implicit in the selection of any particular frame or angle? Blog posts such as What makes a press release newsworthy? It’s all in the news angle look to be linkbait, perhaps even stolen content (eg here’s a PDF), but I can’t offhand find a credible source or inspiration for the original list? Resource packs like this one on Working with the Media from the FAO gives a crash course into what I guess are some of the generally taught basics around story construction?