By chance, I came across a short post by uber-ddj developer Lorenz Matzat (@lorz) on robot journalism over the weekend: Robot journalism: Revving the writing engines. Along with a mention of Narrative Science, it namechecked another company that was new to me: [b]ased in Berlin, Retresco offers a “text engine” that is now used by the German football portal “FussiFreunde”.
A quick scout around brought up this Retresco post on Publishing Automation: An opportunity for profitable online journalism [translated] and their robot journalism pitch, which includes “weekly automatic Game Previews to all amateur and professional football leagues and with the start of the new season for every Game and detailed follow-up reports with analyses and evaluations” [translated], as well as finance and weather reporting.
I asked Lorenz if he was dabbling with such things and he pointed me to AX Semantics (an Aexea GmbH project). It seems their robot football reporting product has been around for getting on for a year (Robot Journalism: Application areas and potential[translated]) or so, which makes me wonder how siloed my reading has been in this area.
Anyway, it seems as if AX Semantics have big dreams. Like heralding Media 4.0: The Future of News Produced by Man and Machine:
The starting point for Media 4.0 is a whole host of data sources. They share structured information such as weather data, sports results, stock prices and trading figures. AX Semantics then sorts this data and filters it. The automated systems inside the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion. By pooling pertinent information, the system automatically pulls together an article. Editors tell the system which layout and text design to use so that the length and structure of the final output matches the required media format – with the right headers, subheaders, the right number and length of paragraphs, etc. Re-enter homo sapiens: journalists carefully craft the information into linguistically appropriate wording and liven things up with their own sugar and spice. Using these methods, the AX Semantics system is currently able to produce texts in 11 languages. The finishing touches are added by the final editor, if necessary livening up the text with extra content, images and diagrams. Finally, the text is proofread and prepared for publication.
A key technology bit is the analysis part: “the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion”. Spotting patterns and events in datasets is an area where automated journalism can help navigate the data beat and highlight things of interest to the journalist (see for example Notes on Robot Churnalism, Part I – Robot Writers for other takes on the robot journalism process). If notable features take the form of possible story points, narrative content can then be generated from them.
To support the process, it seems as if AX Semantics have been working on a markup language: ATML3 (I’m not sure what it stands for? I’d hazard a guess at something like “Automated Text ML” but could be very wrong…) A private beta seems to be in operation around it, but some hints at tooling are starting to appear in the form of ATML3 plugins for the Atom editor.
One to watch, I think…
In The Re-Birth of the “Beat”: A hyperlocal online newsgathering model (Journalism Practice 6.5-6 (2012): 754-765), Murray Dick cites various others to suggest that routine sources are responsible for generating a significant percentage of local news reports:
Schlesinger [Schlesinger, Philip (1987) Putting ‘Reality’ Together: BBC News. Taylor & Francis: London] found that BBC news was dependent on routine sources for up to 80 per cent of its output, while later [Franklin, Bob and Murphy, David (1991) Making the Local News: Local Journalism in Context. Routledge: London] established that local press relied upon local government, courts, police, business and voluntary organisations for 67 per cent of their stories (in [Keeble, Richard (2009) Ethics for Journalists, 2nd Edition. Routledge: London], p114-15)”].
As well as human sources, news gatherers may also look to data sources at either a local level, such as local council transparency (that is, spending data), or national data sources with a local scope as part of a regular beat. For example, the NHS publish accident and emergency statistics as the provider organisation level on a weekly basis, and nomis, the official labour market statistics publisher, publish unemployment figures at a local council level on a monthly basis. Ratings agencies such as the Care Quality Commission (CQC) and the Food Standards Agency (FSA) publish inspections data for local establishments as it becomes available, and other national agencies publish data annually that can be broken down to a local level: if you want to track car MOT failures at the postcode region level, the DVLA have the data that will help you do it.
To a certain extent, adding data sources to a regular beat, or making a beat purely from data sources enables the automatic generation of data driven press releases that can be used to shorten the production process of news reports about a particular class of routine stories that are essentially reports about “the latest figures” (see, for example, my nomis Labour Market Statistics textualisation sketch).
Data sources can also be used to support the newsgathering process by processing the data in order to raise alerts or bring attention to particular facts that might otherwise go unnoticed. Where the data has a numerical basis, this might relate to sorting a national dataset on the basis of some indicator value or other and highlighting to a particular local news outlet that their local X is in the top M or bottom N of similar establishments in the rest of the country, and that there may be a story there. Where the data has a text basis, looking for keywords might pull out paragraphs or records that are of particular interest, or running a text through an entity recognition engine such as Thomson Reuters’ OpenCalais might automatically help identify individuals or organisations of interest.
In this context of this post, I will be considering the role that metadata about court cases that is contained within court lists and court registers might have to play in helping news media identify possibly newsworthy stories arising from court proceedings. I will also explore the extent to which the metadata may be processed, both in order to help identify court proceedings that may be worth reporting on, as well to produce statistical summaries that may in themselves be newsworthy and provide a more balanced view over the activity of the courts than the impression one might get about their behaviour simply from the balance of coverage provided by the media.
The front page of this week’s Isle of Wight County Press describes a tragic incident relating to a particular care home on the Island earlier this year:
(Unfortunately, the story doesn’t seem to appear on the County Press’ website? Must be part of a “divide the news into print and online, and never the twain shall meet” strategy?!)
As I recently started pottering around the CQC website and various datasets they publish, I thought I’d jot down a few notes about what I could find. The clues from the IWCP article were the name of the care home – Waxham House, High Park Road, Ryde – and the proprietor – Sanjay Ramdany.
Using the “CQC Care Directory – With Filters” from the CQC data and information page, I found a couple of homes registered to that provider.
1-120578256, 19/01/2011, Waxham House, 1 High Park Road, Ryde, Isle of Wight, PO33 1BP 1-120578313, 19/01/2011, Cornelia Heights 93 George Street, Ryde, Isle of Wight, PO33 2JE 1-101701588, Mr Sanjay Prakashsingh Ramdany & Mrs Sandhya Kumari Ramdany
Looking up “Waxham House” on the CQC website gives us a copy of the latest report outcome:
Looking at the breadcrumb navigation, it seems we can directly get a list of other homes operated by the same proprietors:
I wonder if we can search the site by proprietor name too?
Looks like it…
So how did their other home fare?
By the by, according to the Food Standards Agency, how’s the food?
And how much money is the local council paying these homes?
[Click through on the image to see the app – hit Search to remove the error message and load the data!]
Why the refunds?
A check on OpenCorporates for director names turned up nothing.
I’m not trying to offer any story here about the actual case reported by the County Press, more a partial story about how we can start to look for data around a story to see if there may be more to the story we can find from open data sources.
Via @simonperry, news that AP will use robots to write some business stories (Automated Insights are one of several companies I’ve been tracking over the years who are involved in such activities, eg Notes on Narrative Science and Automated Insights).
The claim is that using algorithms to do the procedural writing opens up time for the journalists to do more of the sensemaking. One way I see this is that we can use data2text techniques to produce human readable press releases of things like statistical releases, which has a couple of advantages at least.
Firstly, the grunt – and error prone – work of running the numbers (calculating month on month or year on year changes, handling seasonal adjustments etc) can be handled by machines using transparent and reproducible algorithms. Secondly, churning numbers into simple words (“x went up month on month from Sept 2013 to Oct 2013 and down year on year from 2012″) makes them searchable using words, rather than having to write our own database or spreadsheet queries with lots of inequalities in them.
In this respect, something that’s been on my to do list for way to long is to produce some simple “press release” generators based on ONS releases (something I touched on in Data Textualisation – Making Human Readable Sense of Data).
Matt Waite’s upcoming course on “automated story bots” looks like it might produce some handy resources in this regard (code repo). In the meantime, he already shared the code described in How to write 261 leads in a fraction of a second here: ucr-story-bot.
For the longer term, on my “to ponder” list is what might something like “The Grammar of Graphics” be for data textualisation? (For background, see A Simple Introduction to the Graphing Philosophy of ggplot2.)
For example, what might a ggplot2 inspired gtplot library look like for converting data tables not into chart elements, but textual elements? Does it even make sense to try to construct such a grammar? What would the corollaries to aesthetics, geoms and scales be?
I think I perhaps need to mock-up some examples to see if anything comes to mind and that the function names, as well as the outputs, might look like, let alone the code to implement them! Or maybe code first is the way, to get a feel for how to build up the grammar from sensible looking implementation elements? Or more likely, perhaps a bit of iteration may be required?!
Annotated slides from my opening talk at the University of Lincoln Journalism dept. research day – Data Journalism – Having Conversations with Data:
@digiphile’s being doing some digging around current popular usage of the phrase data journalism – here are my recollections…
My personal recollection of the current vogue is that “data driven journalism” was the phrase that dominated the discussions/community I was witness to around early 2009, though for some reason my blog doesn’t give any evidence for that (must take better contemporaneous notes of first noticings of evocative phrases;-). My route in was via “mashups”, mashup barcamps, and the like, where folk were experimenting with building services on newly emerging (and reverse engineered) APIs; things like crime mapping and CraigsList maps were in the air – putting stuff on maps was very popular I seem to recall! Yahoo were one of the big API providers at the time.
I noted the launch of the Guardian datablog and datastore in my personal blog/notebook here – http://blog.ouseful.info/2009/03/10/using-many-eyes-wikified-to-visualise-guardian-data-store-data-on-google-docs/ – though for some reason don’t appear to have linked to a launch post. With the arrival of the datastore it looked like there were to be “trusted” sources of data we could start to play with in a convenient way, accessed through Google docs APIs:-) Some notes on the trust thing here: http://blog.ouseful.info/2009/06/08/the-guardian-openplatform-datastore-just-a-toy-or-a-trusted-resource/
NESTA did an event on News Innovation London in July 2009, a review of which by @kevglobal mentions “discussions about data-driven journalism” (sic on the hyphen). I seem to recall that Journalism.co.uk (@JTownend) were also posting quite a few noticings around the use of data in the news at the time.
At some point, I did a lunchtime at the Guardian for their developers – there was a lot about Yahoo Pipes, I seem to remember! (I also remember pitching the Guardian Platform API to developers in the OU as a way of possibly getting fresh news content into courses. No-one got it…) I recall haranguing Simon Rogers on a regular basis about their lack of data normalisation (which I think in part led to the production of the Rosetta Stone spreadsheet) and their lack of use (at the time) of fusion tables. Twitter archives may turn something up there. Maybe Simon could go digging in the Twitter archives…?;-)
There was a session on related matters at the first(?) news:rewired event in early 2010 but I don’t recall the exact title of the session (I was in a session with Francis Irving/@frabcus from the then nascent Scraperwiki) http://blog.ouseful.info/2010/01/14/my-presentation-for-newsrewired-doing-the-data-mash/ Looking at the content of that presentation, it’s heavily dominated by notions of data flow; the data driven journalism (hence #ddj) phrase, seemed to fit this well.
Later that year, summer, was a roundtable event hosted by the ECJ on “data driven journalism” – I recall meeting Mirko Lorenz there (who maybe had a background in business data? and since helped launch datawrapper.de) and Jonathan Gray – who then went on to help edit the Data Journalism handbook – among others.
For me the focus at the time was very much on using technology to help flow data into useable content, (eg in a similar but perhaps slightly weaker sense than the more polished content generation services that Narrative Science/Automated Insights have since come to work on, or other data driven visualisations or what I guess we might term local information services; more about data driven applications with a weak local news/specific theme or issue general news relevance, perhaps). I don’t remember where the sense of the journalist was in all this – maybe as someone who would be able to take the flowed data, or use tools that were being developed to get the stories out of data with tech support?
My “data driven journalism” phrase notebook timeline
My “data journalist” phrase notebook timeline
My first blogged used of the data journalism phrase, in quotes, as it happens, so it must have been a relatively new sounding phrase to me, was here: http://blog.ouseful.info/2009/05/20/making-it-a-little-easier-to-use-google-spreadsheets-as-a-database-hopefully/ (h/t @paulbradshaw)
Seems like my first use of the “data journalist” phrase was in picking up on a job ad – so certainly the phrase was common to me by then.
As a practice and a commonplace, things still seemed to be developing in 2011 enough for me to comment on a situation where the Guardian and Telegraph teams were co-opetitively bootstrapping each other’s ideas: http://blog.ouseful.info/2011/09/13/data-journalists-engaging-in-co-innovation/
I guess the deeper history of CAR, database journalism, precision journalism may throw off trace references, though maybe not representing situations that led to the phrase gaining traction in “popular” usage?
Certainly, now I’m wondering what the relative rise in popularity of “data journalist” versus “data journalism” was? Certainly, for me, “data driven journalism” was a phrase I was familiar with way before the other two, though I do recall a sense of unease about it’s applicability to news stories that were perhaps “driven” by data more in the sense of being motivated or inspired by it, or whose origins lay in a data set, rather than “driven” in a live, active sense of someone using an interface that was powered by flowing data.
I was pleased to be invited back to the University of Lincoln again yesterday to give a talk on data journalism to a couple of dozen or so journalism students…
I was hoping to generate a copy of the slides (as images) embedded in a markdown version of the notes but couldn’t come up with a quick recipe for achieving that…
When I get a chance, it looks as if the easiest way will be to learn some VBA/Visual Basic for Applications macro scripting… So for example:
If anyone beats me to it, I’m actually on a Mac, so from the looks of things on Stack Overflow, hacks will be required to get the VBA to actually work properly?