n-gram / Multi-Word / Phrase Based Concordances in NLTK

A couple of days ago, my colleague Ray Corrigan shared with me a time consuming problem he was working on looking for original uses of sentences in previously published documents, drafts and bills that are contained in a currently consulting draft code of practice. On the one hand you might thing of this of an ‘original use’ detection problem, on the other, a “plagiarism detection” issue. Text analysis is a data problem I’ve not really engaged with, so I thought this provided an interesting problem set – how can I detect common sentences across two documents. It has the advantage of being easily stated, and easily tested…

I’ll describe in another post how I tried to solve the problem, but in this post will separate out one of the useful tricks I stumbled across along the way – how to display a multiple word concordance using the python text analysis package, NLTK.

To set the scene, the NLTK concordance method will find the bit of a document that a search word appears in and display the word along with the bit of text just before and after it:


Here’s my hacked together code – you’ll see the trick of the phrase start detection is actually a piece of code I found on Stack Overflow relating to the intersection of multiple python lists wrapped up in another found recipe describing how to emulate nltk.text.concordance() and return the found segments:

def n_concordance_tokenised(text,phrase,left_margin=5,right_margin=5):
    #concordance replication via https://simplypython.wordpress.com/2014/03/14/saving-output-of-nltk-text-concordance/
    phraseList=phrase.split(' ')

    c = nltk.ConcordanceIndex(text.tokens, key = lambda s: s.lower())
    #Find the offset for each token in the phrase
    offsets=[c.offsets(x) for x in phraseList]
    #For each token in the phraselist, find the offsets and rebase them to the start of the phrase
    for i in range(len(phraseList)):
        offsets_norm.append([x-i for x in offsets[i]])
    #We have found the offset of a phrase if the rebased values intersect
    # http://stackoverflow.com/a/3852792/454773
    #the intersection method takes an arbitrary amount of arguments
    #result = set(d[0]).intersection(*d[1:])
    concordance_txt = ([text.tokens[map(lambda x: x-left_margin if (x-left_margin)>0 else 0,[offset])[0]:offset+len(phraseList)+right_margin]
                        for offset in intersects])
    outputs=[''.join([x+' ' for x in con_sub]) for con_sub in concordance_txt]
    return outputs

def n_concordance(txt,phrase,left_margin=5,right_margin=5):
    tokens = nltk.word_tokenize(txt)
    text = nltk.Text(tokens)
    return n_concordance_tokenised(text,phrase,left_margin=left_margin,right_margin=right_margin)

If there are better ways of doing this, please let me know via a comment:-)

PS thinking about it, I should possibly tokenise rather than split the phrase? Then the tokens are the same as the tokens used in the matcher?

How to Run A Shiny App in the Cloud Using Tutum, Digital Ocean and Docker Containers

Via RBloggers, I spotted this post on Deploying Your Very Own Shiny Server. I’ve been toying with the idea of running some of my own Shiny apps, so that post provided a useful prompt, though way too involved for me;-)

So here’s what seems to me to be an easier, rather more pointy-clicky, wiring stuff together way using Docker containers (though it might not seem that much easier to you the first time through!). The recipe includes: github, Dockerhub, Tutum and Digital Ocean.

To being with, I created a minimal shiny app to allow the user to select a CSV file, upload it to the app and display it. The ui.R and server.R files, along with whatever else you need, should be placed into an application directory, for example shiny_demo within a project directory, which I’m confusingly also calling shiny_demo (I should have called it something else to make it a bit clearer – for example, shiny_demo_project.)

The shiny server comes from a prebuilt docker container on dockerhub – rocker/shiny.

This shiny server can run several Shiny applications, though I only want to run one: shiny_demo.

I’m going to put my application into it’s own container. This container will use the rocker/shiny container as a base, and simply copy my application folder into the shiny server folder from which applications are served. My Dockerfile is really simple and contains just two lines – it looks like this and goes into a file called Dockerfile in the project directory:

FROM rocker/shiny
ADD shiny_demo /srv/shiny-server/shiny_demo

The ADD command simply copies the the contents of the child directory into a similarly named directory in the container’s /srv/shiny-server/ directory. You could add as many applications you wanted to the server as long as each is in it’s own directory. For example, if I have several applications:


I can add the second application to my container using:

ADD shiny_other_demo /srv/shiny-server/shiny_other_demo

The next thing I need to do is check-in my shiny_demo project into Github. (I don’t have a how to on this, unfortunately…) In fact, I’ve checked my project in as part of another repository (docker-containers).


The next step is to build a container image on DockerHub. If I create an account and log in to DockerHub, I can link my Github account to it.

I can then create an Automated Build that will build a container image from my Github repository. First, identify the repository on my linked Github account and name the image:


Then add the path the project directory that contains the Dockerfile for the image you’re interested in:


Click on Trigger to build the image the first time. In the future, every time I update that folder in the repository, the container image will be rebuilt to include the updates.

So now I have a Docker container image on Dockerhub that contains the Shiny server from the rocker/shiny image and a copy of my shiny application files.

Now I need to go Tutum (also part of the Docker empire), which is an application for launching containers on a range of cloud services. If you link your Digital Ocean account to tutum, you can use tutum to launch docker containers on Dockerhub on a Digital Ocean droplet.

Within tutum, you’ll need to create a new node cluster on Digital Ocean:

(Notwithstanding the below, I generally go for a single 4GB node…)

Now we need to create a service from a container image:

I can find the container image I want to deploy on the cluster that I previously built on Dockerhub:


Select the image and then configure it – you may want to rename it, for example. One thing you definitely need to do though is tick to publish the port – this will make the shiny server port visible on the web.


Create and deploy the service. When the container is built, and has started running, you’ll be told where you can find it.


Note that if you click on the link to the running container, the default URL starts with tcp:// which you’ll need to change to http://. The port will be dynamically allocated unless you specified a particular port mapping on the service creation page.

To view your shiny app, simply add the name of the folder the application is in to the URL.

When you’ve finished running the app, you may want to shut the container down – and more importantly perhaps, switch the Digital Ocean droplet off so you don’t continue paying for it!


As I said at the start, the first time round seems quite complicated. After all, you need to:

(Actually, you can miss out the dockerhub steps, and instead link your github account to your tutum account and do the automated build from the github files within tutum: Tutum automatic image builds from GitHub repositories. The service can then be launched by finding the container image in your tutum repository)

However, once you do have your project files in github, you can then easily update them and easily launch them on Digital Ocean. In fact, you can make it even easier by adding a deploy to tutum button to a project README.md file in Github.

See also: How to run RStudio on Digital Ocean via Tutum and How to run OpenRefine on Digital Ocean via Tutum.

PS to test the container locally, I launch a docker terminal from Kitematic, cd into the project folder, and run something like:

docker build -t psychemedia/shinydemo .
docker run --name shinydemo -i -t psychemedia/shinydemo

I can then set the port map and find a link to the server from within Kitematic.

Charting Terrorism Related Arrest Flows Through The Criminal Justice System

One of my daily read feeds is a list of the day’s government statistical releases. Today, I spotted a release on the Operation of police powers under the Terrorism Act 2000, quarterly update to September 2015, which included an annes on Arrests and outcomes, year ending September 2015:


I tweeted a link to doc, and Michael/@fantasticlife replied with a comment it might look interesting as a Sankey diagram…


So here’s a quick sketch generated using SankeyMATIC:


I took the liberty of adding an extra “InSystem’ step into the chart to account for the feedback look of the bailed arrests.

Here’s the data I used:

Arrested [192] InSystem
Arrested [115] Released without charge
Arrested [8] Alternative action
InSystem [124] Charged
InSystem [68] Released on bail
Charged [111] Terrorism Related
Charged [13] Non-terrorism related
Terrorism Related [36] Prosecuted.t
Terrorism Related [1] Not proceeded against
Terrorism Related [74] Awaiting prosecution
Non-terrorism related [6] Prosecuted.n
Non-terrorism related [2] Not proceeded against
Non-terrorism related [5] Awaiting prosecution
Prosecuted.t [33] Convicted (terrorism related)
Prosecuted.t [2] Convicted (non-terrorism related)
Prosecuted.t [1] Acquitted
Prosecuted.n [5]  Convicted (non-terrorism related)
Prosecuted.n [1] Acquitted

Looking at the diagram, I find the placement of the labels quite confusing and I’m not really sure what relate to what. (The numbers, for example…) It would also be neater if we could capture flows still “in-the system”, for example by stopping the Released on bail element at the same depth as the Charged elements, and also keeping the Awaiting prosecution element short of the right hand side. (Perhaps bail and awaiting elements could be added into a “limbo” field?)

So – nice idea, but as soon as you look at it you see that a quick look at trivial sketch immediately identifies all sorts of other issues that you need to take into account to make the diagram informatively glanceable…


Thinks.. SankeyMATIC is a d3.js app… it would be nice if I could drag the elements in the generator to may the diagram a bit clearer… maybe I can?
sankeymatic_1000x800 (1)

Only that’s wrong too… because the InSystem label applies to the boundary to the left, and the Bail label to the right… So we need to tweak it a bit more…

sankeymatic_1200x800 (1)

In fact, you may notice that the labels seem to be applied left and right justified according to different rules? Hmmm… Not so simple again…

How about if I take out the insterstitial value I added?

sankeymatic_1200x800 (2)

That’s perhaps a bit clearer? And all goes some way to showing how constructing a graphic is generally an iterative process, scaffolding the evolution of the diagram as you go, as you learn to see it/read it from different perspectives and tweak it to try to clarify particular communicative messages? (Which in this case, for me, was to try to tease out how far through the process various flows had got, as well as clearly identify final outcomes…)

Other things we could do to try to improve the graphic are experiment a bit more with the colour schemes. But that’s left as an exercise for the reader…;-)

Pondering New Ways of Programming Lego EV3 Mindstorms Bricks

We’re due to update our first level residential school day long robotics activity for next year, moving away from the trusty yellow RCX Lego Mindstorms bricks that have served us well for getting on a decade or so, I guess, and up to the EV3 Mindstorms bricks.

Students programmed the old kit via a brilliant user interface developed by my colleague Jon Rosewell, originally for the “Robotics and the Meaning of Life” short course, but soon quickly adopted for residential school, day school, and school-school (sic) outreach activities.


The left hand side contained a palette of textual commands that could be dragged onto the central canvas in order to create a tree-like programme. Commands could be dragged and relocated within the tree. Commands taking variable values had the values set by selecting the appropriate line of code and then using the dialogue in the bottom left corner to set the value.

The programme could be downloaded and executed on the RCX brick, or used to control the behaviour of the simple simulated robot in the right hand panel. A brilliant, brilliant piece of educational software. (A key aspect of this is how easy it is to support in a lab or classroom with a dozen student groups who need help debugging their programmes. The UI is clear, easily seen over the shoulder, and fixes to buggy code can typically be easily be fixed with a word or two of explanation. The text-but-not metaphor reduces typos (it’s really a drag and drop UI but with text blocks rather than graphical blocks) as well as producing pretty readable code.

For the new residential school, we’ve been trying to identify what makes sense software wise. The default Lego software is based on Labview, but I think it looks a bit toylike (which isn’t necessarily a problem) but IMHO could be hard to help debug in a residential school setting, which probably is an issue. “Real” LabView can also be used to program the bricks (I think), but again the complexity of the software, and similar issues in quick-fire debugging, are potential blockers. Various third party alternatives to the Lego software are possible: LeJOS, a version of Java that’s been running on Mindstorms bricks for what feels like forever is one possibility; ev3dev is another, a Linux distribution for the brick that lets you run things like Python, and the python-ev3 python package is another. You can also run an IPython notebook from the brick – that is, the IPython notebook server runs on the brick and you can then access the notebook via a browser running on a machine with a network connection to the brick…

So as needs must (?!;-), I spent a few hours today setting up an EV3 with ev3dev, python-ev3 and an IPython notebook server. Following along the provided instructions, everything seemed to work okay with a USB connection to my Mac, including getting the notebooks to auto-run on boot, but I couldn’t seem to get an ssh or http connection with a bluetooth connection. I didn’t have a nano-wifi dongle either, so I couldn’t try a wifi connection.

The notebooks seem to be rather slow when running code cells, although responsiveness when I connected to the brick via an ssh terminal from my mac seemed pretty good for running command line commands at least. Code popped into an executable, shebanged python file can be run from the brick itself simply by selecting the file from the on-board file browser, so immediately a couple of possible workflows are possible:

  • programme the brick via an IPython notebook running on the brick, executing code a cell at a time to help debug it;
  • write the code somewhere, pop it into a text file, copy it onto the brick and then run it from the brick;

It should also be possible to export the code from a notebook into an executable file that could be run from the on-brick file browser.

Another option might be to run IPython on the brick, accessed from an ssh terminal, to support interactive development a line at a time:


This seems to be pretty quick/responsive, and offers features such as autocomplete prompts, though perhaps not as elegantly as the IPython notebooks manage.

However, the residential school activities require students to write complete programmes, so the REPL model of the interactive IPython interpreter is perhaps not the best environment?

Thinking more imaginatively about setting, if we had wifi working, and with a notebook server running on the brick, I could imagine programming and interacting with the brick from an IPython notebook accessed via a browser on an tablet (assuming it’s easy enough to get network connections working over wifi?) This could be really attractive for opening up how we manage the room for the activity, because it would mean we could get away from the computer lab/computer workstation model for each group and have a far more relaxed lab setting. The current model has two elbow height demonstration tables about 6′ x 3’6 around which students gather for mildly competitive “show and tell” sessions, so having tablets rather than workstations for the programming could encourage working directly around the tables as well?

That the tablet model might be interesting to explore originally came to me when I stumbled across the Microsoft Touch Develop environment, which provides a simple programming environment with a keyboard reminiscent of that of a ZX Spectrum with single keyboard keys inserting complete text commands.


Sigh… those were the days…:


Unfortunately there doesn’t seem to be an EV3 language pack for Touch Develop:-(

However, there does appear to be some activity around developing a Python editor for use in Touch Develop, albeit just a straightforward text editor:

As you may have noticed, this seems to have been developed for use with the BBC Microbit, which will be running MicroPython, a version of Python3 purpose built for microcontrollers (/via The Story of MicroPython on the BBC micro:bit).

It’s maybe worth noting that TouchDevelop is accessed via a browser and can be run in the cloud or locally (touchdevelop local).

We’re currently also looking for a simple Python programming environment for a new level 1 course, and I wonder if something of this ilk might be appropriate for that…?

Finally, another robot related ecosystem that crossed my path this week, this time via @Downes – the Poppy Project, which proudly declares itself as “an open-source platform for the creation, use and sharing of interactive 3D printed robots”. Programming is via pypot, a python library that also works with the (also new to me) V-REP virtual robot experimentation platform, a commercial environment though it does seem to have a free educational license. (The poppy project folk also seem keen on IPython notebooks, auto-running them from the Raspberry Pi boards used to control the poppy project robots, not least as a way of sharing tutorials.)

I half-wondered if this might be relevant for yet another new course, this time at level 2, on electronics – though it will also include some robotics elements, including access (hopefully) to real robots via a remote lab. These will be offered as part of the OU’s OpenSTEM Lab which I think will be complementing the current, and already impressive, OpenScience Lab with remotely accessed engineering experiments and demonstrations.

Let’s just hope we can get a virtual computing lab opened too!

PS some notes to self about using the ev3dev:

  • for IP xx.xx.xx.xx, connect with: ssh root@xx.xx.xx.xx and password r00tme
  • notebook startup with permission 755 in: /etc/init.d/ipev3
    ipython notebook --no-browser --notebook-dir=/home --ip=* --port=8889 &

    update-rc.d ipev3 defaults then update-rc.d ipev3 disable (also: enable | start | stop | remove (this don’t work with current init.d file – need proper upstart script?)
  • look up connection file: eg in notebook %connect info and from a local copy of the json file and appropriate IP address xx.xx.xx.xx ipython qtconsole --existing ~/Downloads/ev3dev.json --ssh root@xx.xx.xx.xx with password r00tme
  • alternatively, on the brick, find the location of the connection file, first via the profile ipython locate profile and then inside e.g. ls -al /root/.ipython/profile_default/security to find it and view it.

See also: http://www.beein.cz/en/idea/pc/lego-ev3-python

Tinkering With MOOC Data – Rebasing Time

[I’ve been asked to take this post down because it somehow goes against, I dunno, something, but as a stop gap I’ll try to just remove the charts and leave the text, to see if I get another telling off…]

As a member of an organisation where academics tend to be course designers and course producers, and kept as far away from students as possible (Associate Lecturers handle delivery as personal tutors and personal points of contact), I’ve never really got my head around what “learning analytics” is supposed to deliver: it always seemed far more useful to me to think about course analytics as way of tracking how the course materials are working and whether they seem to be being used as intended. Rather than being interested in particular students, the emphasis would be more on how a set of online course materials work in much the same way as tracking how any website works. Which is to say, are folk going to the pages you expect, spending the time on them you expect, reaching goal pages as and when you expect, and so on.

Having just helped out on a MOOC, I was allowed to have a copy of the course related data files the provider makes available to partners:

I'm not allowed to show you this, apparently...

The course was on learning to code for data analysis using the Python pandas library, so I thought I’d try to apply what was covered in the course (perhaps with a couple of extra tricks…) to the data that flowed from the course…

And here’s one of the tricks… rebasing (normalising) time.

For example, one of the things I was interested in was how long learners were spending on particular steps and particular weeks on the one hand, and how long their typical study sessions were on the other. This could then all be aggregated to provide some course stats about loading which could feed back into possible revisions of the course material, activity design (and redesign) etc.

Here’s an example of how a randomly picked learner progressed through the course:

I'm not allowed to show you this, apparently...

The horizontal x-axis is datetime, the vertical y axis is an encoding of the week and step number, with clear separation between the weeks and steps within a week incrementally ordered. The points show the datetime at which the learner first visited the step. The points are coloured by “stint”, a trick I borrowed from my F1 data wrangling stuff: during the course of a race, cars complete several “stints”, where a stint corresponds to a set laps completed on a particular set of tyres; analysing races based on stints can often turn up interesting stories…

To identify separate study session (“stints”) I used a simple heuristic – if the gap between start-times of consecutively studied stints exceeded a certain threshold (55 minutes, say), then I assumed that the steps were considered in separate study sessions. This needs a bit of tweaking, possibly, perhaps including timestamps from comments or question responses that can intrude on long gaps to flag them as not being breaks in study, or perhaps making the decision about whether the gap between two steps is actually a long one compared to a typically short median time for that step? (There are similar issues in the F1 data, for example when trying to work out whether a pit stop may actually be a drive-through penalty rather than an actual stop.)

In the next example, I rebased the time for two learners based on the time they first encountered the first step of the course. That is, the “learner time” (in hours) is the time between them first seeing a particular step, and the time they first saw their first step. The colour field distiguishes between the two learners.

I'm not allowed to show you this, apparently...

We can draw on the idea of “stints”, or learner sessions further, and use the earliest time within a stint to act as the origin. So for example, for another random learner, here we see an incremental encoding on of the step number on the y-axis, with the weeks clearly separated, the “elapsed study session time” along the horizontal y-axis, and the colour mapping out the different study sessions.

I'm not allowed to show you this, apparently...

The spacing on the y-axis needs sorting out a bit more so that it shows clearer progression through steps, perhaps by using an ordered categorical axis with a faint horizontal rule separator to distinguish the separate weeks. (Having an interactive pop-up that contains some information the particular step each mark refers to, as well as information about how much time was spent on it, whether there was commenting activity, etc, what the mean and median study time for the step is, etc etc, could also be useful.) However, I have to admit that I find charting in pandas/matplotlib really tricky, and only seem to have slightly more success with seaborn; I think I may need to move this stuff over to R so that I can make use of ggplot, which I find far more intuitive…

Finally, whilst the above charts are at the individual learner level, my motivation for creating them was to better understand how the course materials were working, and to try to get my eye in to some numbers that I could start to track as aggregate numbers (means, medians, etc) over the course as a whole. (Trying to find ways of representing learner activity so that we could start to try to identify clusters or particular common patterns of activity / signatures of different ways of studying the course, is then another whole other problem – though visual insights may also prove helpful there.)

Some Jupyter Notebook / nbconvert Housekeeping Hints

A few snippets and bits of pieces regarding housekeeping around Jupyter notebooks.

Clearing Output Cells

Via Matthias Bussonnier, this handy command for rendering a version of a notebook with a the output cells cleared:

jupyter nbconvert --to notebook --ClearOutputPreprocessor.enabled=True NOTEBOOK.ipynb

Adding --inplace will rewrite the notebook with cleared output cells.

Custom Templates

If you have a custom template or a custom config file in the current directory, you can invoke them using:

jupyter nbconvert --config=my_config.py NOTEBOOK.ipynb
jupyter nbconvert --template=my_template.tpl NOTEBOOK.ipynb

I found running the --log-level=DEBUG flag was also handy…

Via MinRk, additional paths can be set using c.TemplateExporter.template_path.append('/path/to/templates') though I’m not really sure where that setting needs to be applied. (Whilst I love the Jupyter project, I really struggle to keep track of where things are supposed to be located and which bits are working/don’t work anymore:-(

He also notes that [a]bsolute template paths will also work if you specify: c.TemplateExporter.template_path.append('/'), adding the comment that [a]bsolute paths for templates should probably work without modifying template_path, but they don’t right now.

It would be really handy if the ability to specify an absolute path in the command line setting did work out of the can…

Split a Long Notebook into Multiple Sub-Notebooks

The Jupyter notebooks allow you to split long cells into two cells using the cursor as a split point, but how about splitting a long notebook into multiple notebooks?

The following test script will split a notebook into sub-notebooks at an explicit split point – a markdown cell containing just the string SPLIT NOTEBOOK.

import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

#Partition a long notebook into subnotebooks at specificed split points
#Enter SPLIT NOTEBOOK on its own in a markdown cell to specify a split point
for i in mynb['cells']:
    if (i['cell_type']=='markdown'):
        if ('SPLIT NOTEBOOK' in i['source']):
    elif (i['cell_type']=='code'):
        for o in i['outputs']:

I should probably tidy this up so that it reuses the original notebook name rather than the subNotebook stub. It might also be handy to turn this into a notebook extension that lets splits the current notebook into two relative to the current cursor location (e.g. all cells above the selected cell go in one sub-notebook, everything from the selected cell to the end of the notebook going into a second sub-notebook.

Another refinement might allow for the declaration of a comment set of cells to prefix each sub-notebook. By default, this could be the set of cells prior to the first split point. (Which is to say for N split points, there would be N, rather than N+1, sub-notebooks, the cells above the first split point appearing in each sub-notebook. The first sub-notebook would thus contain cells starting with the first cell after the first split point, prefixed by the cells appearing before the first split point; and the last sub-notebook would contain the cells starting with the first cell after the last split point, prefixed once again by the cells appearing before the first split point.

A second utility to merge, or concatenate, two or more notebooks when provided with their filenames might also be handy…

Anything Else?

So what other handy Jupyter notebook / nbconvert housekeeping hints and tricks am I missing?

Some Random Upstart Debugging Notes…

…just so I don’t lose them…

dmesg spews out messages a Linux kernel has been issuing… which should also appear in /var/log/syslog (h/t Rod Norfor)

/var/log/upstart/SERVICE.log has log messages from trying to start a service SERVICE.

/etc/init.d should contain what looks like a generic sort of file, filename SERVICE, with the actual config file that contains the command you want to start the service in a file SERVICE.conf in /etc/init.

To generate the files that will have a go at auto-running the service, run the command update-rc.d SERVICE defaults.

Start a service with service SERVICE start, stop it with service SERVICE stop, and restart (stop if started, then start) with service SERVICE restart. Find out what’s going on with it using service SERVICE status.