Tagged: ddj

“Local stories of national interest” – New Johnston Press (Data Journalism) Investigations Unit

Complementing the approach of Trinity Mirror, who launched a cross-group data journalism unit back in 2013, Johnston Press has pulled together a (virtual?) Investigations Unit made up from several investigative and data skilled reporters from across the Johnston Press regional titles (press release).

The unit’s first campaign is focussed on sentences awarded for causing death by dangerous driving. The campaign allows the unit to report on national datasets, as such, as well as developing local stories based on examples taken from the national dataset, bubbling up local stories to wider national interest as campaign hooks. From the press release announcing the launch of the unit, it seems as if this campaigning style of national/local investigative reporting will be underpin the unit’s activities.

“As well as carrying out investigations, and telling powerful human interest stories, the unit has a campaigning and lobbying role at its heart” – Johnston Press press release.

The use of campaigns means the same theme can be kept alive and repeatedly reported on as on ongoing series over an extended period of time, tracked nationally but reported in a local context on the one hand, promoting local campaigns and then reporting them widely on the other.

The national/local model is one that I’ve long thought makes sense, though I’ve not really considered it in terms of the local to national twist. Instead, I’ve been framing it as an opportunity to address centrally common pain points that may be experienced trying to produce a story from data at a local level, as discussed in these thoughts on a locally targeted, nationally scoped datawire.

National dataset local story

One advantage of this approach is scale: graphics communicating national level statistics can be produced centrally and reused across local titles, perhaps with local customisation; local stories can be used to provide relevance to generic “national context” inserts reused across titles; and story templates can be customised to generate local reports from the same national dataset.

Another advantage with looking at national datasets is that they can help flag the newsworthiness of a local story given its national context (for example, national rankings generate story points for the top M, bottom N rankings).

I haven’t spent much time thinking about the campaign aspect, but on quick reflection I think that campaigns can act as nice wrappers for a wider range of templated activities an outputs.

For example, I’ve written a couple of times about the notion of story templates, noting how these have been rolled out in previous years by at least the Johnston Press and Trinity Mirror (Local News Templates – A Business Opportunity for Data Journalists?).

And eighteen months or so ago, I was fortunate enough to spend a couple of days seeing how Ruby Kitchen, then of the Harrogate Advertiser, now of the Yorkshire Post / Yorkshire Evening Post and the Johnston Press Investigations Unit, worked on a Food Standards Agency story on (Data Journalism in Practice). One of the takeaways for me from that was what was involved in actually making use of leads thrown up from a data trawl and then chasing down people for comment. The work involved in putting together an investigation at a single local level may need to be repeated for other locales, but the process can be reused – the investigatory process can be templated.

On the way back home from Harrogate, I’d started fantasising about putting together a training pack based on the the Food Standards Agency food hygiene ratings data (h/t Andy Dickinson for tangentially reminding me of this a couple of days ago :-), with a dual objective in mind: firstly, to produce a training pack for demonstrating various aspects of how to practically work with national datasets at a local level; secondly, to template a data journalism investigation that could be worked through by local or hyperlocal journalists, or journalism students, to produce a feature local food hygiene ratings. (It’s still sitting on the to do pile… Maybe I should have tried kickstarter!)

(Note that it’s not just news organisations that can scale templated systems, or reuse locally developed solutions for national benefit. For example, see the post Putting Public Open Data to Work…? for several examples of online services developed by local councils and used to publish local data that can also be scaled across other council areas.)

Whilst newspaper groups such as Trinity Mirror or Johnston Press have the scale in terms of the number of local outlets to merit a co-ordinated centre reducing the pain once for working with national datasets and then scaling out the benefits across the regional and local titles, independent hyperlocals are often more resource bound when it comes to pursuing investigations (though The Bristol Cable among others repeatedly shows how hyperlocal led investigations are possible).

Whilst I keep not starting to properly scope a hyperlocal datawire service, Will Perrin’s  Local News Engine seems to have gained some traction in its development recently (Early proof of concept for Local News Engine [code]). This service “is testing the theory that story leads can be found in local data where a newsworthy person or place is engaged in a newsworthy activity”, searching local datasources (license applications, planning applications) for notable names (see for example What data are we using in Local News Engine? and Who, what and where is newsworthy for Local News Engine?). The approach taken – named entity extraction cross-referenced with the names of local notables – complements an alternative approach that I favour for the datawire that would flag local stories from national datasets based on things like top N, bottom M rankings, outliers, notable trends or dramatic change in statistics for a local area from a national dataset based on a comparison with previous data releases, other locales and national averages.

PS you can tell this is a personal blog post, not a piece of journalism – I didn’t reach out to anyone from the Johnston Press, or Trinity Mirror, or get in touch with Will Perrin to check facts or ask for comment. It’s all just my personal comment, bias, interpretation and opinion….

PPS See also Archant’s Investigations Unit (2015 announcement) – h/t Andy Dickinson.

Charting Terrorism Related Arrest Flows Through The Criminal Justice System

One of my daily read feeds is a list of the day’s government statistical releases. Today, I spotted a release on the Operation of police powers under the Terrorism Act 2000, quarterly update to September 2015, which included an annes on Arrests and outcomes, year ending September 2015:


I tweeted a link to doc, and Michael/@fantasticlife replied with a comment it might look interesting as a Sankey diagram…


So here’s a quick sketch generated using SankeyMATIC:


I took the liberty of adding an extra “InSystem’ step into the chart to account for the feedback look of the bailed arrests.

Here’s the data I used:

Arrested [192] InSystem
Arrested [115] Released without charge
Arrested [8] Alternative action
InSystem [124] Charged
InSystem [68] Released on bail
Charged [111] Terrorism Related
Charged [13] Non-terrorism related
Terrorism Related [36] Prosecuted.t
Terrorism Related [1] Not proceeded against
Terrorism Related [74] Awaiting prosecution
Non-terrorism related [6] Prosecuted.n
Non-terrorism related [2] Not proceeded against
Non-terrorism related [5] Awaiting prosecution
Prosecuted.t [33] Convicted (terrorism related)
Prosecuted.t [2] Convicted (non-terrorism related)
Prosecuted.t [1] Acquitted
Prosecuted.n [5]  Convicted (non-terrorism related)
Prosecuted.n [1] Acquitted

Looking at the diagram, I find the placement of the labels quite confusing and I’m not really sure what relate to what. (The numbers, for example…) It would also be neater if we could capture flows still “in-the system”, for example by stopping the Released on bail element at the same depth as the Charged elements, and also keeping the Awaiting prosecution element short of the right hand side. (Perhaps bail and awaiting elements could be added into a “limbo” field?)

So – nice idea, but as soon as you look at it you see that a quick look at trivial sketch immediately identifies all sorts of other issues that you need to take into account to make the diagram informatively glanceable…


Thinks.. SankeyMATIC is a d3.js app… it would be nice if I could drag the elements in the generator to may the diagram a bit clearer… maybe I can?
sankeymatic_1000x800 (1)

Only that’s wrong too… because the InSystem label applies to the boundary to the left, and the Bail label to the right… So we need to tweak it a bit more…

sankeymatic_1200x800 (1)

In fact, you may notice that the labels seem to be applied left and right justified according to different rules? Hmmm… Not so simple again…

How about if I take out the insterstitial value I added?

sankeymatic_1200x800 (2)

That’s perhaps a bit clearer? And all goes some way to showing how constructing a graphic is generally an iterative process, scaffolding the evolution of the diagram as you go, as you learn to see it/read it from different perspectives and tweak it to try to clarify particular communicative messages? (Which in this case, for me, was to try to tease out how far through the process various flows had got, as well as clearly identify final outcomes…)

Other things we could do to try to improve the graphic are experiment a bit more with the colour schemes. But that’s left as an exercise for the reader…;-)

Some Idle Thoughts on Managing Temporal Posts in WordPress

Now that I’ve got a couple of my own WordPress blogs running off the back of my Reclaim Hosting account, I’ve started to look again at possible ways of tinkering with WordPress.

The first thing I had a look at was posting a draft WordPress post from a script.

Using a WordPress role editor plugin (e.g. a long the lines of this User Role Editor) it’s easy enough to create a new role with edit and upload permissions only [WordPress roles and capabilities], and create a new ‘autoposter’ user with that role. Code like the following then makes it easy enough to upload an image to WordPress, grab the URL, insert it into a post, and then submit the post – where it will, by default, appear as a draft post:

#Ish Via: http://python-wordpress-xmlrpc.readthedocs.org/en/latest/examples/media.html
from wordpress_xmlrpc import Client, WordPressPost
from wordpress_xmlrpc.compat import xmlrpc_client
from wordpress_xmlrpc.methods import media, posts
from wordpress_xmlrpc.methods.posts import NewPost

wp = Client('http://blog.example.org/xmlrpc.php', ACCOUNTNAME, ACCOUNT_PASSWORD)

def wp_simplePost(client,title='ping',content='pong, <em>pong<em>'):
    post = WordPressPost()
    post.title = title
    post.content = content
    response = client.call(NewPost(post))
    return response

def wp_uploadImageFile(client,filename):

    mimes={'png':'image/png', 'jpg':'image/jpeg'}
    # prepare metadata
    data = {
            'name': filename,
            'type': mimetype,  # mimetype

    # read the binary file and let the XMLRPC library encode it into base64
    with open(filename, 'rb') as img:
            data['bits'] = xmlrpc_client.Binary(img.read())

    response = client.call(media.UploadFile(data))
    return response

def quickTest():
    txt = "Hello World"
    txt=txt+'<img src="{}"/><br/>'.format(wp_uploadImageFile(wp,'hello2world.png')['url'])
    return txt


Dabbling with this then got me thinking about the different sorts of things that WordPress allows you to publish in general. It seems to me that there are essentially three main types of thing you can publish:

  1. posts: the timestamped elements that appear in a reverse chronological order in a WordPress blog. Posts can also be tagged and categorised and viewed via a tag or category page. Posts can be ‘persisted’ at the top of the posts page by setting them as a “sticky” post.
  2. pages: static content pages typically used to contain persistent, unchanging content. For example, an “About” page. Pages can also be organised hierarchically, with child subpages defined relative to a specified ‘parent’ page.
  3. sidebar elements and widgets: these can contain static or dynamic content.

(By the by, a range of third party plugins appear to support the conversion of posts to pages, for example Post Type Switcher [untested] or the bulk converter Convert Post Types [untested].)

Within a page or a post, we can also include a shortcode element that can be used to include a small piece of templated text or generated from the execution of some custom code (which it seems could be python: running a python script from a WordPress shortcode). Shortcodes run each time a page is loaded, although you can use the WordPress Transients database API to implement a simple cache for them to improve performance (eg as described here and here).

Within a post, page or widget, we can also embed dynamic content. For example, we could embed a map that displays dynamically created markers that are essentially out of the control of the page or post publisher. Note that by default WordPress strips iframes from content (and it also seems reluctant to allow the upload of html files to the media gallery, at least by default). The preferred way to include custom embedded content seems to be to define a shortcode to embed the required content, although there are plugins around that allow you to embed iframes. (I didn’t spot one that let you inline the content of the iframe using srcdoc though?)

When we put together the Isle of Wight planning applications : Mapped page, one of the issues related to how updates to the map should be posted over time.


That is, should the map be uploaded to a fixed page and show only the most recent data, should it be posted as a timestamped post, to provide archival copies of the page, or should it be posted to a page and support a timeslider/history function?

Thinking about this again, the distinction seems to rely on what sort of (re)discovery we want to encourage or support. For example, if the page is a destination page, then we should probably use a page with a fixed URL for the most recent map. Older maps could be accessed via archive links, or perhaps subpages, if a time-filter wasn’t available on a single map view. Alternatively, we might want to alert readers to the map, in which case it might make more sense to use a timestamped post. (We could of course use a post to announce an update to the page, perhaps including a screenshot of the latest map in the post.)

It also strikes me that we need to consider publication schedules by a news outlet compared to the publication schedules associated with a particular dataset.

For example, Land Registry House Prices Paid data is published on a monthly basis a few weeks after each month the data has been collected for. In this case, it probably makes sense to publish on a monthly basis.

But what about care home or food outlet inspection data? The CQC publish data as it becomes available, although searches support the retrieval of data for a particular area published over the last week or last month relative the time the search is made. The Food Standards Agency produce updates to data download files on a daily basis, but the file for any particular area is only updated when it contains new data. (So on any given day, you don’t know which, if any, area files will be updated.)

In this case, it may well be that a news outlet may want to do a couple of things:

  • publish summaries of reports over the last week or last month, on a weekly or monthly schedule – “The CQC published reports for N care homes in the region over the last month, of which X were positive and Y were negative”, etc.
  • engage in a more immediate or responsive publication of stories around particular reports as they are published by the responsible agency. In this case, the journalist needs to find a way of discovering stories in a timely fashion, either through signing up to alerts or inspecting the agency site on a regular basis.

Again, it might be that we can use posts and pages in complementary way: pages that act as fixed destination sites with a fixed URL, and perhaps links off to archived historical sub-pages, as well as related news stories, that contain the latest summary; and posts that announce timely reports as well as ‘page updated’ announcements when the slower-changing page is updated.

More abstractly, it probably makes sense to consider the relative frequencies with which data is originally published (also considering whether the data is published according to a fixed schedule, or in a more responsive way as and when data becomes available), the frequency with which journalists check the data site, and the frequency with which journalists actually publish data related stories.

Robot Journalists or Robot Press Secretaries? Why Automated Reporting Doesn’t Need to be That Clever

Automated content generators, aka robot journalists, are turning everywhere at the moment, it seems: the latest to cross my radar being a mention of “Dreamwriter” from Chinese publisher Tencent (End of the road for journalists? Tencent’s Robot reporter ‘Dreamwriter’ churns out perfect 1,000-word news story – in 60 seconds) to add to the other named narrative language generating bots I’m aware of, Automated Insight’s Wordsmith and Narrative Science’s Quill, for example.

Although I’m not sure of the detail, I assume that all of these platforms make use of quite sophisticated NLG (natural language generation) algorithms, to construct phrases, sentences, paragraphs and stories from atomic facts, identified story points, journalistic tropes and encoded linguistic theories.

One way of trying to unpick the algorithms is to critique, or even try to reverse engineer, stories known to be generated by the automatic content generators, looking for clues as to how they’re put together. See for example this recent BBC News story on Robo-journalism: How a computer describes a sports match.

Chatting to media academic Konstantin Dörr/@kndoerr in advance of the Future of Journalism conference in Cardiff last week (I didn’t attend the conference, just took the opportunity to grab a chat with Konstantin a couple of hours before his presentation on the ethical challenges of algorithmic journalism) I kept coming back to thoughts raised by my presentation at the Community Journalism event the day before [unannotated slides] about the differences between what I’m trying to explore and these rather more hyped-up initiatives.

In the first place, as part of the process, I’m trying to stay true to posting relatively simple – and complete – recipes that describe the work I’m doing so that others can play along. Secondly, in terms of the output, I’m not trying to do the NLG thing. Rather, I’m taking a template based approach – not much more than a form letter mail merge approach – to putting data into a textual form. Thirdly, the audience for the output is not the ultimate reader of a journalistic piece; rather, the intended audience is an intermediary, a journalist or researcher who needs an on-ramp providing them with useable access to data relevant to them that they can then use as the possible basis for a story.

In other words, the space I’m exploring is in-part supporting end-user development / end user programming (for journalist end-users, for example), in part automated or robotic press secretaries (not even robot reporters; see for example Data Reporting, not Data Journalism?) – engines that produce customised press releases from a national dataset at a local level that report a set of facts in a human readable way, perhaps along with supporting assets such as simple charts and very basic observational analysis (this month’s figures were more than last month’s figures, for example).

This model is one that supports a simple templated approach for a variety of reasons:

  1. each localised report has the same form as any other localised report (eg a report on jobseeker’s allowance figures for the Isle of Wight can take the same form as a report for Milton Keynes);
  2. it doesn’t matter so much if the report reads a little strangely, as long as the facts and claims are correct, because the output is not intended for final publication, as is, to the public – rather, it could be argued that it’s more like a badly written, fact based press statement that at least needs to go through a copy editor! In other words, we can start out scruffy…
  3. the similarity in form of one report to another is not likely to be a distraction to the journalist in the way that it would be to a general public reader presented with several such stories and expecting an interesting – and distinct – narrative in each one. Indeed, the consistent presentation might well aid the journalist in quickly spotting the facts and deciding on a storyline and what contextualisation may be required to add further interpretative value to it.
  4. targeting intermediary users rather than end user: the intermediary users all get to add their own value or style to the piece before the wider publication of the material, or use the data in support of other stories. That is, the final published form is not decided by the operator of the automatic content generator; rather, the automatically generated content is there to be customised, augmented, or used as supporting material, by an intermediary, or simply act as a “conversational” representation of a particular set of data provided to an intermediary.


The generation of the local datasets rom the national dataset is trivial – having generated code to slice out one dataset (by postcode or local authority, for example), we can slice out any other. The generation of the press releases from the local datasets can make use of the same template. This can be applied locally (a hyperlocal using it’s own template, for example) or centrally created and managed as part of a datawire service.

At the moment, the couple of automatically generated stories published with OnTheWight have been simple fact reporting, albeit via a human editor, rather than acting as the starting point for a more elaborate, contextualised, narrative report. But how might we extend this approach?

In the case of Jobseeker’s Allowance figures, contextualising paragraphs such as the recent closure of a local business, or the opening of another, as possible contributory factors to any month on month changes to the figures, could add colour or contextualisation to a monthly report.

Or we might invert the use of the figures, adding them as context to workforce, employment or labour related stories. For example, in the advent of a company closure, contextualisation of what the loss of numbers relative to local unemployment figures. (This fact augmented reporting is more likely to happen if the figures are readily available/to hand, as they are via autoresponder channels such as a Slackbot Data Wire.)

But I guess we have to start somewhere! And that somewhere is the simple (automatically produced, human copy edited) reporting of the facts.

PS in passing, I note via Full Fact that the Department of Health “will provide press officers [with an internal ‘data document’] with links to sources for each factual claim made in a speech, as well as contact details for the official or analyst who provided the information”, Department of Health to speed up responses to media and Full Fact. Which gets me thinking: what form might a press office publishing “data supported press releases” take, cf. a University Expert Press Room or Social Media Releases and the University Press Office, for example?

Fragment – Data Journalism or Data Processing?

A triptych to read and reflect on in the same breath…

String of Rulings Bodes Ill for the Future of Journalism in Europe:

On July 21, 2015, the European Court of Human Rights ruled that making a database of public tax records accessible digitally was illegal because it violated the right to privacy [1]. The judges wrote that publishing an individual’s (already public) data on an online service could not be considered journalism, since no journalistic comment was written alongside it.

This ruling is part of a wider trend of judges limiting what we can do with data online. A few days later, a court of Cologne, Germany, addressed data dumps. In this case, the German state sued a local newspaper that published leaked documents from the ministry of Defense related to the war in Afghanistan. The documents had been published in full so that users could highlight the most interesting lines. The ministry sued on copyright grounds and the judges agreed, arguing that the journalists should have selected some excerpts from the documents to make their point and that publishing the data in its entirety was not necessary [2].

These two rulings assume that journalism must take the form of a person collecting information then writing an article from it. It was true in the previous century but fails to account for current journalistic practices.

ICO: Samaritans Radar failed to comply with Data Protection Act:

It is our view that if organisations collect information from the internet and use it in a way that’s unfair, they could still breach the data protection principles even though the information was obtained from a publicly available source. It is particularly important that organisations should consider the data protection implications if they are planning to use analytics to make automated decisions that could have a direct effect on individuals.

The Labour Party “purge” and social media privacy:

[A news article suggests] that the party has been scouring the internet to find social media profiles of people who have registered. Secondly, it seems to suggest that for people not to have clearly identifiable social media profiles is suspicious.

The first idea, that it’s ‘OK’ to scour the net for social media profiles, then analyse them in detail is one that is all too common. ‘It’s in the public, so it’s fair game’ is the essential argument – but it relies on a fundamental misunderstanding of privacy, and of the way that people behave.

Collecting “public” data and processing or analysing it may bring the actions of the processor into the scope of the Data Protection Act. Currently, the Act affords protections to to journalists. But if these protections are eroded, it weakens the ability of journalists to use these powerful investigatory tools.

Robot Journalism in Germany

By chance, I came across a short post by uber-ddj developer Lorenz Matzat (@lorz) on robot journalism over the weekend: Robot journalism: Revving the writing engines. Along with a mention of Narrative Science, it namechecked another company that was new to me: [b]ased in Berlin, Retresco offers a “text engine” that is now used by the German football portal “FussiFreunde”.

A quick scout around brought up this Retresco post on Publishing Automation: An opportunity for profitable online journalism [translated] and their robot journalism pitch, which includes “weekly automatic Game Previews to all amateur and professional football leagues and with the start of the new season for every Game and detailed follow-up reports with analyses and evaluations” [translated], as well as finance and weather reporting.

I asked Lorenz if he was dabbling with such things and he pointed me to AX Semantics (an Aexea GmbH project). It seems their robot football reporting product has been around for getting on for a year (Robot Journalism: Application areas and potential[translated]) or so, which makes me wonder how siloed my reading has been in this area.

Anyway, it seems as if AX Semantics have big dreams. Like heralding Media 4.0: The Future of News Produced by Man and Machine:

The starting point for Media 4.0 is a whole host of data sources. They share structured information such as weather data, sports results, stock prices and trading figures. AX Semantics then sorts this data and filters it. The automated systems inside the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion. By pooling pertinent information, the system automatically pulls together an article. Editors tell the system which layout and text design to use so that the length and structure of the final output matches the required media format – with the right headers, subheaders, the right number and length of paragraphs, etc. Re-enter homo sapiens: journalists carefully craft the information into linguistically appropriate wording and liven things up with their own sugar and spice. Using these methods, the AX Semantics system is currently able to produce texts in 11 languages. The finishing touches are added by the final editor, if necessary livening up the text with extra content, images and diagrams. Finally, the text is proofread and prepared for publication.

A key technology bit is the analysis part: “the software then spot patterns in the information using detection techniques that revolve around rule-based semantic conclusion”. Spotting patterns and events in datasets is an area where automated journalism can help navigate the data beat and highlight things of interest to the journalist (see for example Notes on Robot Churnalism, Part I – Robot Writers for other takes on the robot journalism process). If notable features take the form of possible story points, narrative content can then be generated from them.

To support the process, it seems as if AX Semantics have been working on a markup language: ATML3 (I’m not sure what it stands for? I’d hazard a guess at something like “Automated Text ML” but could be very wrong…) A private beta seems to be in operation around it, but some hints at tooling are starting to appear in the form of ATML3 plugins for the Atom editor.

One to watch, I think…