Automated content generators, aka robot journalists, are turning everywhere at the moment, it seems: the latest to cross my radar being a mention of “Dreamwriter” from Chinese publisher Tencent (End of the road for journalists? Tencent’s Robot reporter ‘Dreamwriter’ churns out perfect 1,000-word news story – in 60 seconds) to add to the other named narrative language generating bots I’m aware of, Automated Insight’s Wordsmith and Narrative Science’s Quill, for example.
Although I’m not sure of the detail, I assume that all of these platforms make use of quite sophisticated NLG (natural language generation) algorithms, to construct phrases, sentences, paragraphs and stories from atomic facts, identified story points, journalistic tropes and encoded linguistic theories.
One way of trying to unpick the algorithms is to critique, or even try to reverse engineer, stories known to be generated by the automatic content generators, looking for clues as to how they’re put together. See for example this recent BBC News story on Robo-journalism: How a computer describes a sports match.
Chatting to media academic Konstantin Dörr/@kndoerr in advance of the Future of Journalism conference in Cardiff last week (I didn’t attend the conference, just took the opportunity to grab a chat with Konstantin a couple of hours before his presentation on the ethical challenges of algorithmic journalism) I kept coming back to thoughts raised by my presentation at the Community Journalism event the day before [unannotated slides] about the differences between what I’m trying to explore and these rather more hyped-up initiatives.
In the first place, as part of the process, I’m trying to stay true to posting relatively simple – and complete – recipes that describe the work I’m doing so that others can play along. Secondly, in terms of the output, I’m not trying to do the NLG thing. Rather, I’m taking a template based approach – not much more than a form letter mail merge approach – to putting data into a textual form. Thirdly, the audience for the output is not the ultimate reader of a journalistic piece; rather, the intended audience is an intermediary, a journalist or researcher who needs an on-ramp providing them with useable access to data relevant to them that they can then use as the possible basis for a story.
In other words, the space I’m exploring is in-part supporting end-user development / end user programming (for journalist end-users, for example), in part automated or robotic press secretaries (not even robot reporters; see for example Data Reporting, not Data Journalism?) – engines that produce customised press releases from a national dataset at a local level that report a set of facts in a human readable way, perhaps along with supporting assets such as simple charts and very basic observational analysis (this month’s figures were more than last month’s figures, for example).
This model is one that supports a simple templated approach for a variety of reasons:
- each localised report has the same form as any other localised report (eg a report on jobseeker’s allowance figures for the Isle of Wight can take the same form as a report for Milton Keynes);
- it doesn’t matter so much if the report reads a little strangely, as long as the facts and claims are correct, because the output is not intended for final publication, as is, to the public – rather, it could be argued that it’s more like a badly written, fact based press statement that at least needs to go through a copy editor! In other words, we can start out scruffy…
- the similarity in form of one report to another is not likely to be a distraction to the journalist in the way that it would be to a general public reader presented with several such stories and expecting an interesting – and distinct – narrative in each one. Indeed, the consistent presentation might well aid the journalist in quickly spotting the facts and deciding on a storyline and what contextualisation may be required to add further interpretative value to it.
- targeting intermediary users rather than end user: the intermediary users all get to add their own value or style to the piece before the wider publication of the material, or use the data in support of other stories. That is, the final published form is not decided by the operator of the automatic content generator; rather, the automatically generated content is there to be customised, augmented, or used as supporting material, by an intermediary, or simply act as a “conversational” representation of a particular set of data provided to an intermediary.
The generation of the local datasets rom the national dataset is trivial – having generated code to slice out one dataset (by postcode or local authority, for example), we can slice out any other. The generation of the press releases from the local datasets can make use of the same template. This can be applied locally (a hyperlocal using it’s own template, for example) or centrally created and managed as part of a datawire service.
At the moment, the couple of automatically generated stories published with OnTheWight have been simple fact reporting, albeit via a human editor, rather than acting as the starting point for a more elaborate, contextualised, narrative report. But how might we extend this approach?
In the case of Jobseeker’s Allowance figures, contextualising paragraphs such as the recent closure of a local business, or the opening of another, as possible contributory factors to any month on month changes to the figures, could add colour or contextualisation to a monthly report.
Or we might invert the use of the figures, adding them as context to workforce, employment or labour related stories. For example, in the advent of a company closure, contextualisation of what the loss of numbers relative to local unemployment figures. (This fact augmented reporting is more likely to happen if the figures are readily available/to hand, as they are via autoresponder channels such as a Slackbot Data Wire.)
But I guess we have to start somewhere! And that somewhere is the simple (automatically produced, human copy edited) reporting of the facts.
PS in passing, I note via Full Fact that the Department of Health “will provide press officers [with an internal ‘data document’] with links to sources for each factual claim made in a speech, as well as contact details for the official or analyst who provided the information”, Department of Health to speed up responses to media and Full Fact. Which gets me thinking: what form might a press office publishing “data supported press releases” take, cf. a University Expert Press Room or Social Media Releases and the University Press Office, for example?
By chance, I came across a short post by uber-ddj developer Lorenz Matzat (@lorz) on robot journalism over the weekend: Robot journalism: Revving the writing engines. Along with a mention of Narrative Science, it namechecked another company that was new to me: [b]ased in Berlin, Retresco offers a “text engine” that is now used by the German football portal “FussiFreunde”.
A quick scout around brought up this Retresco post on Publishing Automation: An opportunity for profitable online journalism [translated] and their robot journalism pitch, which includes “weekly automatic Game Previews to all amateur and professional football leagues and with the start of the new season for every Game and detailed follow-up reports with analyses and evaluations” [translated], as well as finance and weather reporting.
I asked Lorenz if he was dabbling with such things and he pointed me to AX Semantics (an Aexea GmbH project). It seems their robot football reporting product has been around for getting on for a year (Robot Journalism: Application areas and potential[translated]) or so, which makes me wonder how siloed my reading has been in this area.
Anyway, it seems as if AX Semantics have big dreams. Like heralding Media 4.0: The Future of News Produced by Man and Machine:
The starting point for Media 4.0 is a whole host of data sources. They share structured information such as weather data, sports results, stock prices and trading figures. AX Semantics then sorts this data and filters it. The automated systems inside the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion. By pooling pertinent information, the system automatically pulls together an article. Editors tell the system which layout and text design to use so that the length and structure of the final output matches the required media format – with the right headers, subheaders, the right number and length of paragraphs, etc. Re-enter homo sapiens: journalists carefully craft the information into linguistically appropriate wording and liven things up with their own sugar and spice. Using these methods, the AX Semantics system is currently able to produce texts in 11 languages. The finishing touches are added by the final editor, if necessary livening up the text with extra content, images and diagrams. Finally, the text is proofread and prepared for publication.
A key technology bit is the analysis part: “the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion”. Spotting patterns and events in datasets is an area where automated journalism can help navigate the data beat and highlight things of interest to the journalist (see for example Notes on Robot Churnalism, Part I – Robot Writers for other takes on the robot journalism process). If notable features take the form of possible story points, narrative content can then be generated from them.
To support the process, it seems as if AX Semantics have been working on a markup language: ATML3 (I’m not sure what it stands for? I’d hazard a guess at something like “Automated Text ML” but could be very wrong…) A private beta seems to be in operation around it, but some hints at tooling are starting to appear in the form of ATML3 plugins for the Atom editor.
One to watch, I think…