Category: Infoskills

Finding Common Phrases or Sentences Across Different Documents

As mentioned in the previous post, I picked up on a nice little challenge from my colleague Ray Corrigan a couple days ago to find common sentences across different documents.

My first, rather naive, thought was to segment each of the docs into sentences and then compare sentences using a variety of fuzzy matching techniques, retaining the ones that sort-of matched. That approach was a bit ropey (I’ll describe it in another post), but whilst pondering it over a dog walk a much neater idea suggested itself – compare n-grams of various lengths over the two documents. At it’s heart, all we need to do is find the intersection of the ngrams that occur in each document.

So here’s a recipe to do that…

First, we need to get documents into a text form. I started off with PDF docs, but it was easy enough to extract the text using textract.

!pip install textract

import textract
txt = textract.process('ukpga_19840012_en.pdf')

The next step is to compare docs for a particular size n-gram – the following bit of code finds the common ngrams of a particular size and returns them as a list:

import nltk
from nltk.util import ngrams as nltk_ngrams

def common_ngram_txt(tokens1,tokens2,size=15):
    print('Checking ngram length {}'.format(size))
    ng1=set(nltk_ngrams(tokens1, size))
    ng2=set(nltk_ngrams(tokens2, size))

    print('..found {}'.format(len(match)))

    return match

I want to be able to find common ngrams of various lengths, so I started to put together the first fumblings of an n-gram sweeper.

The core idea was really simple – starting with the largest common n-gram, detect increasingly smaller n-grams; then do a concordance report on each of the common ngrams to show how that ngram appeared in the context of each document. (See n-gram / Multi-Word / Phrase Based Concordances in NLTK.)

Rather than generate lots of redundant reports – if I detected the common 3gram “quick brown fox”, I would also find the common ngrams “quick brown” and “brown fox” – I started off with the following heuristic: if a common n-gram is part of a longer common n-gram, ignore it. But this immediately turned up a problem. Consider the following case:

Document 1: the quick brown fox
Document 2: the quick brown fox and the quick brown cat and the quick brown dog

Here, there is a common 4-tuple:the quick brown fox. There is also a common 3-tuple: the quick brown, which a concordance plot would reveal as being found in the context of a cat and a dog as well as a fox. What I really need to do is keep a copy of common n-gram locations that are not contained within the context of a longer n-gram context in the second document, but drop copies of locations where it is subsumed in an already found longer ngram.

Indexing on token number within the second doc, I need to return something like this:

([('the', 'quick', 'brown', 'fox'),
  ('the', 'quick', 'brown'),
  ('the', 'quick', 'brown')],
 [[0, 3], [10, 12], [5, 7]]

which shows up the shorter common ngrams only in places where it is not part of the longer common ngram.

In the following, n_concordance_offset() finds the location of a phrase token list within a document token list. The ngram_sweep_txt() scans down a range of ngram lengths, starting with the longest, trying to identify locations that are not contained with an already discovered longer ngram.

def n_concordance_offset(text,phraseList):
    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    return intersects
def ngram_sweep_txt(txt1,txt2,ngram_min=8,ngram_max=50):    
    tokens1 = nltk.word_tokenize(txt1)
    tokens2 = nltk.word_tokenize(txt2)

    text1 = nltk.Text( tokens1 )
    text2 = nltk.Text( tokens2 )

    for i in range(ngram_max,ngram_min-1,-1):
        #Find long ngrams first
        for m in newsweep:

            #We need to avoid the problem of masking shorter ngrams by already found longer ones
            #eg if there is a common 3gram in a doc2 4gram, but the 4gram is not in doc1
            #so we need to see if the current ngram is contained within the doc index of longer ones already found
            for o in localoffsets:
                for r in ranges:
                    if o>=r[0] and o<=r[1]:
                if not fnd:
    return ngrams,ranges,txt1,txt2

def ngram_sweep(fn1,fn2,ngram_min=8,ngram_max=50):
    txt1 = textract.process(fn1).decode('utf8')
    txt2 = textract.process(fn2).decode('utf8')
    return ngrams,ranges,txt1,txt2

What I really need to do is automatically detect the largest n-gram and work back from there, perhaps using a binary search starting with an n-gram the size of the number of tokens in the shortest doc… But that’s for another day…

Having discovered common phrases, we need to report them. The following n_concordance() function, (based on this) does just that; the concordance_reporter() function manages the outputs.

import textract

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)


    intersects= n_concordance_offset(text,phraseList)
    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                        for offset in intersects])
    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def concordance_reporter(fn1='Draft_Equipment_Interference_Code_of_Practice.pdf',
    f=open(fo, 'w+')
    print('Handling {}'.format(fo))
    ngrams,strings, txt1,txt2=ngram_sweep(fn1,fn2,ngram_min,ngram_max)
    #Remove any redundancy in the ngrams...
    with open(fo, 'a') as outfile:
        outfile.write('REPORT FOR ({} and {}\n\n'.format(fn1,fn2))
        print('found {} ngrams in that range...'.format(len(ngrams)))
        for m in ngrams:
            mt=' '.join(m)
            for c in n_concordance(txt1,mt,left_margin,right_margin):
            for c in n_concordance(txt2,mt,left_margin,right_margin):

Finally, the following function makes it easier to compare a document of interest with several other documents:

for f in ['Draft_Investigatory_Powers_Bill.pdf','ukpga_19840012_en.pdf',

Here’s an example of the sort of report it produces:

REPORT FOR (Draft_Equipment_Interference_Code_of_Practice.pdf and ukpga_19970050_en.pdf

concerning an individual ( whether living or dead ) who can be identified from it

>>>>>personal information is information held in confidence concerning an individual ( whether living or dead ) who can be identified from it , and the material in question relates 

<<<<<section `` personal information '' means information concerning an individual ( whether living or dead ) who can be identified from it and relating—- ( a ) to his 

satisfied that the action authorised by it is no longer necessary .

>>>>>must cancel a warrant if he is satisfied that the action authorised by it is no longer necessary . 4.13 The person who made the application 

<<<<<an authorisation given in his absence if satisfied that the action authorised by it is no longer necessary . ( 6 ) If the authorising officer 

<<<<<cancel an authorisation given by him if satisfied that the action authorised by it is no longer necessary . ( 5 ) An authorising officer shall 

involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose

>>>>>one or more offences and : It involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose ; or a person aged twenty-one or 

<<<<<if , — ( a ) it involves the use of violence , results in substantial financial gain or is conduct by a large number of persons in pursuit of a common purpose , or ( b ) the offence 

to an express or implied undertaking to hold it in confidence

>>>>>in confidence if it is held subject to an express or implied undertaking to hold it in confidence or is subject to a restriction on 

>>>>>in confidence if it is held subject to an express or implied undertaking to hold it in confidence or it is subject to a restriction 

<<<<<he holds it subject— ( a ) to an express or implied undertaking to hold it in confidence , or ( b ) to a 

no previous convictions could reasonably be expected to be sentenced to

>>>>>a person aged twenty-one or over with no previous convictions could reasonably be expected to be sentenced to three years’ imprisonment or more . 4.5 

<<<<<attained the age of twenty-one and has no previous convictions could reasonably be expected to be sentenced to imprisonment for a term of three years 

considers it necessary for the authorisation to continue to have effect for the purpose for which it was

>>>>>to have effect the Secretary of State considers it necessary for the authorisation to continue to have effect for the purpose for which it was given , the Secretary of State may 

<<<<<in whose absence it was given , considers it necessary for the authorisation to continue to have effect for the purpose for which it was issued , he may , in writing 

The first line in a block is the common phrase, the >>> elements are how the phrase appears in the first doc, the <<< elements are how it appears in the second doc. The width of the right and left margins of the contextual / concordance plot are parameterised and can be easily increased.

This seems such a basic problem – finding common phrases in different documents – I’d have expected there to be a standard solution to this? But in the quick search I tried, I couldn’t find one? It was quite a fun puzzle to play with though, and offers lots of scope for improvement (I suspect it’s a bit ropey when it comes to punctuation, for example). But it’s a start…:-)

There’s lots could be done on a UI front, too. For example, it’d be nice to be able to link documents, so you can click through from the first to show where the phrase came from in the second. But to do that requires annotating the original text, which in turn means being able to accurately identify where in a doc a token sequence appears. But building UIs is hard and time consuming… it’d be so much easier if folk could learn to use a code line UI!;-)

If you know of any “standard” solutions or packages for dealing with this sort of problem, please let me know via the comments:-)

PS The code could also do with some optimisation – eg if we know we’re repeatedly comparing against a base doc, it’s foolish to keep opening and tokenising the base doc…

Slackbot Data Wire, Initial Sketch

Via a round-up post from Matt Jukes/@jukesie (Interesting elsewhere: Bring on the Bots), I was prompted to look again at Slack. OnTheWight’s Simon Perry originally tried to hook me in to Slack, but I didn’t need another place to go to check messages. Simon had also mentioned, in passing, how it would be nice to be able to get data alerts into Slack, but I’d not really followed it through, until the weekend, when I read again @jukesie’s comment that “what I love most about it [Slack] is the way it makes building simple, but useful (or at least funny), bots a breeze.”

After a couple of aborted attempts, I found a couple of python libraries to wrap the Slack API: pyslack and python-rtmbot (the latter also requires python-slackclient).

Using pyslack to send a message to Slack was pretty much a one-liner:

#Create API token at

#!pip install pyslack
import slack
slack.api_token = token'#general', 'Hello world', username='testbot')


I was quite keen to see how easy it would be to reuse one of more of my data2text sketches as the basis for an autoresponder that could get accept a local data request from a Slack user and provide a localised data response using data from a national dataset.

I opted for a JSA (Jobseekers Allowance) textualiser (as used by OnTheWight and reported here: Isle of Wight innovates in a new area of Journalism and also in this piece: How On The Wight is experimenting with automation in news) that I seem to have bundled up into a small module, which would let me request JSA figures for a council based on a council identifier. My JSA textualiser module has a couple of demos hardwired into it (one for the Isle of Wight, one for the UK) so I could easily call on those.

To put together an autoresponder, I used the python-rtmbot, putting the botcode folder into a plugins folder in the python-rtmbot code directory.

The code for the bot is simple enough:

from nomis import *
import nomis_textualiser as nt
import pandas as pd


import time
crontable = []
outputs = []

def process_message(data):

	text = data["text"]
	if text.startswith("JSA report"):
		if 'IW' in text: outputs.append([data['channel'], nt.otw_rep1(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.otw_rep1(nt.ukCode)])
	if text.startswith("JSA rate"):
		if 'IW' in text: outputs.append([data['channel'], nt.rateGetter(nt.iwCode)])
		elif 'UK' in text: outputs.append([data['channel'], nt.rateGetter(nt.ukCode)])


Rooting around, I also found a demo I’d put together for automatically looking up a council code from a Johnston Press newspaper title using a lookup table I’d put together at some point (I don’t remember how!).

Which meant that by using just a tiny dab of glue I could extend the bot further to include a lookup of JSA figures for a particular council based on the local rag JP covering that council. And the glue is this, added to the process_message() function definition:

	def getCodeForTitle(title):
		return code

	if text.startswith("JSA JP"):
		title=text.split('JSA JP')[1].strip()

		outputs.append([data['channel'], nt.otw_rep1(code)])
		outputs.append([data['channel'], nt.rateGetter(code)])


This is quite an attractive route, I think, for national newsgroups: anyone in the group can create a bot to generate press release style copy at a local level from a national dataset, and then make it available to reporters from other titles in the group – who can simply key in by news title.

But it could work equally well for a community network of hyperlocals, or councils – organisations that are locally based and individually do the same work over and over again on national datasets.

The general flow is something a bit like this:


which has a couple of very obvious pain points:


Firstly, finding the local data from the national data, cleaning the data, etc etc. Secondly, making some sort of sense of the data, and then doing some proper journalistic work writing a story on the interesting bits, putting them into context and explaining them, rather than just relaying the figures.

What the automation route does is to remove some of the pain, and allow the journalist to work up the story from the facts, presented informatively.


This is a model I’m currently trying to work up with OnTheWight and one I’ll be talking about briefly at the What next for community journalism? event in Cardiff on Wednesday [slides].

PS Hmm.. this just in, The Future of the BBC 2015 [PDF] [announcement].

Local Accountability Reporting Service

Under this proposal, the BBC would allocate licence fee funding to invest in a service that reports on councils, courts and public services in towns and cities across the UK. The aim is to put in place a network of 100 public service reporters across the country.

Reporting would be available to the BBC but also, critically, to all reputable news organisations. In addition, while it would have to be impartial and would be run by the BBC, any news organisation — news agency, independent news provider, local paper as well as the BBC itself—could compete to win the contract to provide the reporting team for each area.

A shared data journalism centre Recent years have seen an explosion in data journalism. New stories are being found daily in government data, corporate data, data obtained under the Freedom of Information Act and increasing volumes of aggregated personalised data. This data offers new means of sourcing stories and of holding public services, politicians and powerful organisations to account.

We propose to create a new hub for data journalism, which serves both the BBC and makes available data analysis for news organisations across the UK. It will look to partner a university in the UK, as the BBC seeks to build a world-class data journalism facility that informs local, national and global news coverage.

A News Bank to syndicate content

The BBC will make available its regional video and local audio pieces for immediate use on the internet services of local and regional news organisations across the UK.

Video can be time-consuming and resource-intensive to produce. The News Bank would make available all pieces of BBC video content produced by the BBC’s regional and local news teams to other media providers. Subject to rights and further discussion with the industry we would also look to share longer versions of content not broadcast, such as sports interviews and press conferences.

Content would be easily searchable by other news organisations, making relevant material available to be downloaded or delivered by the outlets themselves, or for them to simply embed within their own websites. Sharing of content would ensure licence fee payers get maximum value from their investment in local journalism, but it would also provide additional content to allow news organisations to strengthen their offer to audiences without additional costs. We would also continue to enhance linking out from BBC Online, building on the work of Local Live.

Hmm… Share content – or share “pre-content”. Use BBC expertise to open up the data to more palatable forms, forms that the BBC’s own journalists can work with, but also share those intermediate forms with the regionals, locals and hyperlocals?

Data Literacy – Do We Need Data Scientists, Or Data Technicians?

One of the many things I vaguely remember studying from my school maths days are the various geometric transformations – rotations, translations and reflections – as applied particularly to 2D shapes. To a certain extent, knowledge of these operations helps me use the limited Insert Shape options in Powerpoint, as I pick shapes and arrows from the limited palette available and then rotate and reflect them to get the orientation I require.

But of more pressing concern to me on a daily basis is the need to engage in data transformations, whether as summary statistic transformations (find the median or mean values within several groups of the same dataset, for example, or calculating percentage differences away from within group means across group members for multiple groups, or shape transformations, reshaping a dataset from a wide to a long format, for example, melting a subset of columns or recasting a molten dataset into a wider format. (If that means nothing to you, I’m not surprised. But if you’ve ever worked with a dataset and copied and pasted data from multiple columns in to multiple rows to get it to look right/into the shape you want, you’ve suffered by not knowing how to reshape your dataset!)

Even though I tinker with data most days, I tend to avoid all but the simplest statistics. I know enough to know I don’t understand most statistical arcana, but I suspect there are folk who do know how to do that stuff properly. But what I do know from my own tinkering is that before I can run even the simplest stats, I often have to do a lot of work getting original datasets into a state where I can actually start to work with them.

The same stumbling blocks presumably present themselves to the data scientists and statisticians who not only know how to drive arcane statistical tests but also understand how to interpret and caveat them. Which is where tools like Open Refine come in…

Further down the pipeline are the policy makers and decision makers who use data to inform their policies and decisions. I don’t see why these people should be able to write a regexp, clean a dirty dataset, denormalise a table, write a SQL query, run a weird form of multivariate analysis, or reshape a dataset and then create a novel data visualisation from it based on a good understanding of the principles of The Grammar of Graphics; but I do think they should be able to pick up on the stories contained within the data and critique the way it is presented, as well as how the data was sourced and the operations applied to it during analysis, in addition to knowing how to sensibly make use of the data as part of the decision making or policy making process.

A recent Nesta report (July 2015) on Analytic Britain: Securing the right skills for the data-driven economy [PDF] gave a shiny “analytics this, analytics that” hype view of something or other (I got distracted by the analytics everything overtone), and was thankfully complemented by a more interesting report from the Universities UK report (July 2015) on Making the most of data: Data skills training in English universities [PDF].

In its opening summary, the UUK report found that “[t]he data skills shortage is not simply characterised by a lack of recruits with the right technical skills, but rather by a lack of recruits with the right combination of skills”, and also claimed that “[m]any undergraduate degree programmes teach the basic technical skills needed to understand and analyse data”. Undergrads may learn basic stats, but I wonder how many of them are comfortable with the hand tools of data wrangling that you need to be familiar with if you ever want to turn real data into something you can actually work with? That said, the report does give a useful review of data skills developed across a range of university subject areas.

(Both reports championed the OU-led urban data school, though I have to admit I can’t find any resources associated with that project? Perhaps the OU’s Smart Cities MOOC on FutureLearn is related to it? As far as I know, OUr Learn to Code for Data Analysis MOOC isn’t?)

From my perspective, I think it’d be a start if folk learned:

  • how to read simple charts;
  • how to identify meaningful stories in charts;
  • how to use data stories to inform decision making.

I also worry about the day-to-day practicalities of working with data in a hands on fashion and the roles associated with various data related tasks that fall along any portrayal of the data pipeline. For example, of the top of my head I think we can distinguish between things like:

  • data technician roles – for example, reshaping and cleaning datasets;
  • data engineering roles – managing storage, building and indexing databases, for example;
  • data analyst/science and data storyteller roles – that is, statisticians who can work with clean and well organised datasets to pull out structures, trends and patterns from within them;
  • data graphics/visualisation practitioners – who have the eye and the skills for developing visual ways of uncovering and relating the stories, trends, patterns and structures hidden in datasets, perhaps in support of the analyst, perhaps in support of the decision-making end-user ;
  • and data policymakers and data driven decision makers, who can phrase questions in such a way that makes it possible to use data to inform the decision or policymaking process, even if they don’t have to skills to wrangle or analyse the data that they can then use.

I think there is also a role for data questionmasters who can phrase and implement useful and interesting queries that can be applied to datasets, which might also fall to the data technician. I also see a role for data technologists, who are perhaps strong as a data technician, but with an appreciation of the engineering, science, visualisation and decision/policy making elements, though not necessarily strong as a practitioner in any of those camps.

(Data carpentry as a term is also useful, describing a role that covers many of the practical skills requirements I’d associate with a data technician, but that additionally supports the notion of “data craftsmanship”? A lot of data wrangling does come down to being a craft, I think, not least because the person working at the raw data end of the lifecycle may often develop specialist, hand crafted tools for working with the data that an analyst would not be able to justify spending the development time on.)

Here’s another carving of the data practitioner roles space, this time from Liz Lyon & Aaron Brenner (Bridging the Data Talent Gap: Positioning the iSchool as an Agent for Change, International Journal of Digital Curation, 10:1 (2015)):


The Royal Statistical Society Data Manifesto [PDF] (September 2014) argues for giving “[p]oliticians, policymakers and other professionals working in public services (such as regulators, teachers, doctors, etc.) … basic training in data handling and statistics to ensure they avoid making poor decisions which adversely affect citizens” and suggest that we need to “prepare for the data economy” by “skill[ing] up the nation”:

We need to train teachers from primary school through to university lecturers to encourage data literacy in young people from an early age. Basic data handling and quantitative skills should be an integral part of the taught curriculum across most A level subjects. … In particular, we should ensure that all students learn to handle and interpret real data using technology.

I like the sentiment of the RSS manifesto, but fear the Nesta buzzword hype chasing and the conservatism of the universities (even if the UUK report is relatively open minded).

On the one hand, we often denigrate the role of the technician, but I think technical difficulties associated with working with real data are often a real blocker; which means we either skill up ourselves, or recognise the need for skilled data technicians. On the other, I think there is a danger of hyping “analytics this” and “data science that” – even if only as part of debunking it – because it leads us away from the more substantive point that analytics this, data science that is actually about getting numbers into a form that tell stories that we can use to inform decisions and policies. And that’s more about understanding patterns and structures, as well as critiquing data collection and analysis methods, than it is about being a data technician, engineer, analyst, geek, techie or quant.

Which is to say – if we need to develop data literacy, what does that really mean for the majority?

PS Heh heh – Kin Lane captures further life at the grotty end of the data lifecycle: Being a Data Janitor and Cleaning Up Data Portability Vomit.

Converting Spreadsheet Rows to Text Based Summary Reports Using OpenRefine

In Writing Each Row of a Spreadsheet as a Press Release? I demonstrated how we could generate a simple textual report template that could “textualise” separate rows of a spreadsheet. This template could be applied to each row from a subset of rows to to produce a simple human readable view of the data contained in each of those rows. I picked up on the elements of this post in Robot Journalists or Robot Press Secretaries?, where I reinforced the idea that such an approach was of a similar kind to the approach used in mail merge strategies supported by many office suites.

It also struck me that we could use OpenRefine’s custom template export option to generate a similar sort of report. So in this post I’ll describe a simple recipe for recreating the NHS Complaints review reports from a couple of source spreadsheets using OpenRefine.

This is just a recasting of the approach demonstrated in the Writing Each Row… post, and more fully described in this IPython notebook, so even if you don’t understand Python, it’s probably worth reviewing those just to get a feeling of the steps involved.

To start with, let’s see how we might generate a basic template from the complaints CSV file, loaded in with the setting to parse numerical columns as such.


The default template looks something like this:

default template

We can see how each the template provides a header slot, for the start of the output, a template applied to each row, a separator to spilt the rows, and a footer.

The jsonize function makes sure the output is suitable for output as a JSON file. We just want to generate text so we can forget that.

Here’s the start of a simple report…

Report for {{cells["Practice_Code"].value}} ({{cells["Year"].value}}):

  Total number of written complaints received:
  - by area: {{cells["Total number of written complaints received"].value}} (of which, {{cells["Total number of written 
complaints upheld"].value}} upheld)
  - by subject: {{cells["Total number of written complaints received 2"].value}} (of which, {{cells["Total number of written 
complaints upheld 2"].value}} upheld)

custom_export _start

The double braces ({{ }} allow you to access GREL statements. Outside the braces, the content is treated as text.

Note that the custom template doesn’t get saved… I tend to write the custom templates in a text editor, then copy and paste them into OpenRefine.

We can also customise the template with some additional logic using the if(CONDITION, TRUE_ACTION, FALSE_ACTION) construction. For example, we might flag a warning that a lot of complaints were upheld:

openrefine template warning

The original demonstration pulled in additional administrative information (practice name and address, for example) from another source spreadsheet. Merging Datasets with Common Columns in Google Refine describes a recipe for merging in data from another dataset. In this case, if our source is the epraccur spreadsheet, we can create an OpenRefine project from the epraccur spreadsheet (use no lines as the header – it doesn’t have a header row) and then merge in data from the epraccur project into the complaints project using the practice code (Column 1 in the epraccur project) as the key column used to add an additional practice name column based on the Practice_Code column in the complaints project – cell.cross("epraccur xls", "Column 1").cells["Column 2"].value[0]

Note that columns can only be merged in one column at a time.

In order to filter the rows so we can generate reports for just the Isle of Wight, we also need to merge in the Parent Organisation Code (Column 15) from the epraccur project. To get Isle of Wight practices, we could then filter on code 10L. If we then used out custom exporter template, we could get just textual reports for the rows corresponding to Isle of Wight GP practices.

nhs openrefine filter

Teasing things apart a bit, we also start to get a feel for a more general process. Firstly, we can create a custom export template to generate a textual representation of each row in a dataset. Secondly, we can use OpenRefine’s filtering tools to select which rows we want to generate reports from, and order them appropriately. Thirdly, we could also generate new columns containing “red flags” or news signals associated with particular rows, and produce a weighted sum column on which to rank items in terms of newsworthiness. We might also want to merge in additional data columns from other sources, and add elements from those in to the template. Finally, we might start to refine the export template further to include additional logic and customisation of the news release output.

See also Putting Points on Maps Using GeoJSON Created by Open Refine for a demo of how to generate a geojson file using the OpenRefine custom template exporter as part of a route to getting points onto a map.

Fragment – Data Journalism or Data Processing?

A triptych to read and reflect on in the same breath…

String of Rulings Bodes Ill for the Future of Journalism in Europe:

On July 21, 2015, the European Court of Human Rights ruled that making a database of public tax records accessible digitally was illegal because it violated the right to privacy [1]. The judges wrote that publishing an individual’s (already public) data on an online service could not be considered journalism, since no journalistic comment was written alongside it.

This ruling is part of a wider trend of judges limiting what we can do with data online. A few days later, a court of Cologne, Germany, addressed data dumps. In this case, the German state sued a local newspaper that published leaked documents from the ministry of Defense related to the war in Afghanistan. The documents had been published in full so that users could highlight the most interesting lines. The ministry sued on copyright grounds and the judges agreed, arguing that the journalists should have selected some excerpts from the documents to make their point and that publishing the data in its entirety was not necessary [2].

These two rulings assume that journalism must take the form of a person collecting information then writing an article from it. It was true in the previous century but fails to account for current journalistic practices.

ICO: Samaritans Radar failed to comply with Data Protection Act:

It is our view that if organisations collect information from the internet and use it in a way that’s unfair, they could still breach the data protection principles even though the information was obtained from a publicly available source. It is particularly important that organisations should consider the data protection implications if they are planning to use analytics to make automated decisions that could have a direct effect on individuals.

The Labour Party “purge” and social media privacy:

[A news article suggests] that the party has been scouring the internet to find social media profiles of people who have registered. Secondly, it seems to suggest that for people not to have clearly identifiable social media profiles is suspicious.

The first idea, that it’s ‘OK’ to scour the net for social media profiles, then analyse them in detail is one that is all too common. ‘It’s in the public, so it’s fair game’ is the essential argument – but it relies on a fundamental misunderstanding of privacy, and of the way that people behave.

Collecting “public” data and processing or analysing it may bring the actions of the processor into the scope of the Data Protection Act. Currently, the Act affords protections to to journalists. But if these protections are eroded, it weakens the ability of journalists to use these powerful investigatory tools.

IPython Markdown Opportunities in IPython Notebooks and Rstudio

One of the reasons I started working on the Wrangling F1 Data With R book was to see what the Rmd (RMarkdown) workflow was like. Rmd allows you to combine markdown and R code in the same document, as well as executing the code blocks and then displaying the results of that code execution inline in the output document.


As well as rendering to HTML, we can generate markdown (md is actually produced as the interim step to HTML creation), PDF output documents, etc etc.

One thing I’d love to be able to do in the RStudio/RMarkdown environment is include – and execute – Python code. Does a web search to see what Python support there is in R… Ah, it seems it does it already… (how did I miss that?!)


ADDED: Unfortunately, it seems as if Python state is not persisted between separate python chunks – instead, each chunk is run as a one off python inline python command. However, it seems as if there could be a way round this, which is to use a persistent IPython session; and the knitron package looks like just the thing for supporting that.

So that means in RStudio, I could use knitr and Rmd to write a version of Wrangling F1 Data With RPython

Of course, it would be nicer if I could write such a book in an everyday python environment – such as in an IPython notebook – that could also execute R code (just to be fair;-)

I know that we can already use cell magic to run R in a IPython notebook:


…so that’s that part of the equation.

And the notebooks do already allow us to mix markdown cells and code blocks/output. The default notebook presentation style is to show the code cells with the numbered In []: and Out []: block numbering, but it presumably only takes a small style extension or customisation to suppress that? And another small extension to add the ability to hide a code cell and just display the output?

So what is it that (to my mind at least) makes RStudio a nicer writing environment? One reason is the ability to write the Rmarkdown simply as Rmarkdown in a simple text editor enviroment. Another is the ability to inline R code and display its output in-place.

Taking that second point first, the ability to do better inlining in IPython notebooks – it looks like this is just what the python-markdown extension seems to do:


But how about the ability to write some sort of pythonMarkdown and then open in a notebook? Something like ipymd, perhaps…?


What this seems to do is allow you to open an IPython-markdown document as an IPython notebook (in other words, it replaces the ipynb JSON document with an ipymd markdown document…). To support the document creation aspects better, we just need an exporter that removes the code block numbering and trivially allows code cells to be marked as hidden.

Now I wonder… what would it take to be able to open an Rmd document as an IPython notebook? Presumably just the ability to detect the code language, and then import the necessary magics to handle its execution? It’d be nice if it could cope with inline code, e.g. using the python-markdown magic too?

Exciting times could be ahead:-)

Authoring Dynamic Documents in IPython / Jupyter Notebooks?

One of the reasons I started writing the Wrangling F1 Data With R book was to see how it felt writing combined text, code and code output materials in the RStudio/RMarkdown context. For those of you that haven’t tried it, RMarkdown lets you insert executable code elements inside a markdown document, either as code blocks or inline. The knitr library can then execture the code and display the code output (which includes tables and charts) and pandoc transforms the output to a desired output document format (such as HTML, or PDF, for example). And all this at the click of a single button.

In IPython (now Jupyter) notebooks, I think we can start to achieve a similar effect using a combination of extensions. For example:

  • python-markdown allows you to embed (and execute) python code inline within a markdown cell by enclosing it in double braces (For example, I could say “{{ print(‘hello world’}}”);
  • hide_input_all is an extension that will hide code cells in a document and just display their executed output; it would be easy enough to tweak this extension to allow a user to select which cells to show and hide, capturing that cell information as cell metadata;
  • Readonly allows you to “lock” a cell so that it cannot be edited; using a notebook server that implements this extension means you can start to protect against accidental changes being made to a cell by mistake within a particular workflow; in a journalistic context, assigning a quote to a python variable, locking that code cell, and then referencing that quote/variable in a python-markdown might be one of working, for example.
  • Printview-button will call nbconvert to generate an HTML version of the current notebook – however, I suspect this does not respect the extension based customisations that operate on cell metadata. To do that, I guess we need to generate our outptut via an nbconvert custom template? (The Download As... notebook option doesn’t seem to save the current HTML view of a notebook either?)

So – my reading is: tools are there to support the editing side (inline code, marking cells to be hidden etc) of dynamic document generation, but not necessarily the rendering to hard copy side, which need to be done via nbconvert extensions?

Related: Seven Ways of Running IPython Notebooks