Simple Text Analysis Using Python – Identifying Named Entities, Tagging, Fuzzy String Matching and Topic Modelling

Text processing is not really my thing, but here’s a round-up of some basic recipes that allow you to get started with some quick’n’dirty tricks for identifying named entities in a document, and tagging entities in documents.

In this post, I’ll briefly review some getting started code for:

  • performing simple entity extraction from a text; for example, when presented with a text document, label it with named entities (people, places, organisations); entity extraction is typically based on statistical models that rely on document features such as correct capitalisation of names to work correctly;
  • tagging documents that contain exact matches of specified terms: in this case, we have a list of specific text strings we are interested in (for example, names of people or companies) and we want to know if there are exact matches in the text document and where those matches occur in the document;
  • partial and fuzzing string matching of specified entities in a text: in this case, we may want to know whether something resembling a specified text string occurs in the document (for example, mis0spellings of name);
  • topic modelling: the identification, using statistical models, of “topic terms” that appear across a set of documents.

You can find a gist containing a notebook that summarises the code here.

Simple named entity recognition

spaCy is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.

According to the spaCy entity recognition documentation, the built in model recognises the following types of entity:

  • PERSON People, including fictional.
  • NORP Nationalities or religious or political groups.
  • FACILITY Buildings, airports, highways, bridges, etc.
  • ORG Companies, agencies, institutions, etc.
  • GPE Countries, cities, states. (That is, Geo-Political Entitites)
  • LOC Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT Objects, vehicles, foods, etc. (Not services.)
  • EVENT Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART Titles of books, songs, etc.
  • LANGUAGE Any named language.
  • LAW A legislation related entity(?)

Quantities are also recognised:

  • DATE Absolute or relative dates or periods.
  • TIME Times smaller than a day.
  • PERCENT Percentage, including “%”.
  • MONEY Monetary values, including unit.
  • QUANTITY Measurements, as of weight or distance.
  • ORDINAL “first”, “second”, etc.
  • CARDINAL Numerals that do not fall under another type.

Custom models can also be trained, but this requires annotated training documents.

#!pip3 install spacy
from spacy.en import English
parser = English()
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.
#Code "borrowed" from somewhere?!
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
            tags[term].append((entity.label, entity.label_))
-------------- entities only ---------------
{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}
q= "Bob Smith was in the Houses of Parliament the other day"
-------------- entities only ---------------
{'Bob Smith': [(377, 'PERSON')]}

Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.

-------------- entities only ---------------


A simplistic, and quite slow, tagger, supporting limited recognition of Locations (I-LOC), Organizations (I-ORG) and Persons (I-PER).

#!pip3 install polyglot

##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
#apt-get install libicu-dev
!polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
from polyglot.text import Text

text = Text(example)
[I-PER(['Bob', 'Smith'])]

Partial Matching Specific Entities

Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs’ names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.

To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.

Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.

## The following recipe was hinted at via @pudo

#!pip3 install pyahocorasick

First, construct an automaton that identifies the terms you want to detect in the target text.

from ahocorasick import Automaton

A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson (LC)'))

q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
    print(item, q2[:item[0]+1])
(4, ('PERSON', 'Boris Johnson')) Boris
(12, ('PERSON', 'Boris Johnson')) Boris Johnson
(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe
(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe
(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union

Once again, case is important.

q2l = q2.lower()
for item in A.iter(q2l):
    print(item, q2l[:item[0]+1])
(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson

We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:

A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))

for item in A.iter(q2):
    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo
(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we
(31, (('VOCAB', 6), 'Europe')) to *Europe* to
(60, (('VOCAB', 6), 'Europe')) he *Europe*an 
(68, (('VOCAB', 14), 'European Union')) he *European Union*

Fuzzy String Matching

Whilst the Aho-Corasick approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that almost match terms in specific set of terms.

Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.


The python fuzzyset package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.

For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.

A confidence value expresses the degree of match to terms in the fuzzy match set list.

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
([(0.8666666666666667, 'Boris Johnson')],
 [(0.8333333333333334, 'Diane Abbott')],
 [(0.23076923076923073, 'Diane Abbott')])


If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy library. Once again, we spcify a list of target items we want to try to match against.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']

q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]

By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:

process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
[('Houses of Parliament', 90), ('Boris Johnson', 85)]

A range of fuzzy match scroing algorithms are supported:

  • WRatio – measure of the sequences’ similarity between 0 and 100, using different algorithms
  • QRatio – Quick ratio comparison between two strings
  • UWRatio – a measure of the sequences’ similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode
  • UQRatio – Unicode quick ratio
  • ratio
  • `partial_ratio – ratio of the most similar substring as a number between 0 and 100
  • token_sort_ratio – a measure of the sequences’ similarity between 0 and 100 but sorting the token before comparing
  • partial_token_set_ratio
  • partial_token_sort_ratio – ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing

More usefully, perhaps, is to return items that match above a particular confidence level:

[('Houses of Parliament', 90), ('Diane Abbott', 90)]

However, one problem with the fuzzywuzzy matcher is that it doesn’t tell us where in the supplied text string the match occurred, or what string in the text was matched.

The fuzzywuzzy package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)

names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
process.dedupe(names, threshold=80)
['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']

It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:

import hashlib

for t in names:
    #Generate a key based on the sorted members of the set
    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
    if len(keyvals)>1 and key not in fuzzed:
        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
for cluster in clusters:
[('Diane Abbott', 100), ('Diana Abbot', 87)]
[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]

OpenRefine Clustering

As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the client libary.

In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.

#!pip install git+
#NOTE - this requires a python 2 kernel
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', ''))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
    for t in ['Name']+names+['Boris Johnstone']:
        f.write(t+ '\n')
!cat {project_file}
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone

OpenRefine supports a range of clustering functions:

- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
for cluster in clusters:
[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]
[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]

Topic Models

Topic models are statistical models that attempts to categorise different “topics” that occur across a set of docments.

Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents.


#!pip3 install gensim
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

def print_top_terms(lda, num_terms=10):
    for i in range(0, num_terms):
        terms = [term for term,val in lda.show_topic(i,num_terms)]
        txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
    return '\n'.join(txt)

To start with, let’s create a list of dummy documents and then generate word lists for each document.

docs=['The banks still have a lot to answer for the financial crisis.',
     'This MP and that Member of Parliament were both active in the debate.',
     'The companies that work in finance need to be responsible.',
     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
     'Corporate finance is a big responsibility.']

#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]

#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]

Now we can generate the topic models from the list of word lists.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: parliament debate active
     - top 3 terms for topic #1: responsible work need
     - top 3 terms for topic #2: corporate big responsibility

The model is randomised – if we run it again we are likely to get a different result.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: finance corporate responsibility
     - top 3 terms for topic #1: participants quality high
     - top 3 terms for topic #2: member mp active

O’Reilly LaunchBot – Like a Personal Binder App for Running Jupyter Notebooks and Other Containerised Browser Accessed Apps

One of the issues I keep coming up against when trying to encourage folk to at least give Jupyter notebooks a try is that “it’s too X to install” (for negatively sentimented X…). One way round this is to use something like Binder, which allows you to create a Docker image from the contents of a public Github repository and run a container instance from that image on somebody else’s server, accessing the notebook running “in the cloud” via your own browser. A downside to this approach is that the public binder service can be a bit slow – and a bit flaky. And it probably doesn’t scale well if you’re running a class. (Institutions running their own elastic Binder service could work round this, of course…)

So here’s yet another possible way in for folk – O’Reilly Media’s LaunchBot (via O’Reilly Media CTO Andrew Odewahn’s draft article Computational Publishing with Jupyter) – though admittedly it also requires the possibly (probably?!)  “too hard” requirement to install Docker as a pre-requisite. Which means it’s possibly doomed from the start in my institution…

Anyway, for anyone not in a Computing HE department, and who’s willing to install Docker first (it isn’t really hard at all unless your IT department has got their hands on your computer first: just download the appropriate Community Edition for your operating system (Mac, Windows, Linux) from here), the LaunchBot app provides a handy way to run Jupyter environments in a custom Docker container, “preloaded” with notebooks from a public Github repo.

By the by, if you’re on a Mac, you may be warned off installing LaunchBot.

Simply select the app in a Finder window, right click on it, and then select “Open” from the pop-up menu; agree that you do want to launch it, and it should run happily thereafter.

In a sense, LaunchBot is a bit like running a personal Binder instance: the LaunchBot app, which runs from the desktop but is accessed via browser, allows you to clone projects from public Github repos that contain a Dockerfile, and perhaps a set of notebooks. (You can use this url as an example – it’s a set of notebooks from original OU/FutureLearn Learn to Code for Data Analysis course.) The project README is displayed as part of the project homepage, and an option given to view, edit and save the project Dockerfile.

The Dockerfile can then be used to launch an instance of a corresponding container using your own (locally running) Docker instance:

(A status bar in the top margin tracks progress as image layers are downloaded and Dockerfile commands run.)

The base image specified in the Dockerfile will be downloaded, the Dockerfile run, and I’m guessing links to all services running on exposed ports are displayed:

In the free plan, you are limited to cloning from public repos, with no support for pushing changes back to a repo. In the $7 a month fee based plan, pulls from private repos and pushes to all repos are supported.

From a quick play, LaunchBot is perhaps easier to use than Kitematic. Whilst neither of them support launching linked containers using docker-compose,  LaunchBot’s ability to clone a set of notebooks (and other files) from a repo, as well as the Dockerfile, makes it more attractive for delivering content as well as the custom container runtime environment.

I had a quick look around to see if the source code for LaunchBot was around anywhere, but couldn’t find it offhand. The UI is a likely to be little bit scary for many people who don’t really get the arcana of git (which includes me! I prefer to use it via the web UI ;-) and could be simplified for users who are unlikely to want to push commits back. For example, students might only need to to pull a repo once from a remote master. On the other hand, it might be useful to support some simplified mechanism for pulling down updates (or restoring trashed notebooks?), with conflicts on the client side managed in a very sympathetic and hand-holdy way (perhaps by backing up any files that would otherwise be overwritten from the remote master?!). (At the risk of feature creeping LaunchBot, more git-comfortable users might find some way of integrating the nbdime notebook diff-er useful?)

Being able to brand the LaunchPad UI would also be handy…

From a distributed authoring perspective, the git integration could be handy. As far as the TM351 module team experiments in coming up with our own distributed authoring processes go, the use of a private Github repo means we can’t use the LaunchBot approach, at least under the free plan. (From the odd scraps I’ve found around the OU’s new OpenCreate authoring system, it supposedly supports versioning in some way, so it’ll be interesting to see how that works…) The TM351 dockerised VM also makes use of multiple containers linked using docker-compose, so I guess to use it in a LaunchBot context I’d need to build a monolithic container that runs all the required services (Postgres and MongoDB, as well as Jupyter) in a single image (or may I could run docker-compose inside a Docker container?)

[See also Seven Ways of Running IPython / Jupyter Notebooks, which I guess I probably needs another update…]

PS this could also be a handy tool for supporting authoring and maintenance workflows around notebooks: nbval, a “py.test plugin to validate Jupyter notebooks”. I wonder if it’s happy to test the raising of particular errors/warnings, as well as cleanly executed code fragments? Certainly, one thing we’re lacking in our Jupyter notebook production route at the moment is a sensible way of testing notebooks. At the risk of complicating things further, I wonder how that might fit into something like LaunchBot?

PPS Binder also lacks support for docker-compose, though I have asked about this and there may be a way in if someone figures out a spawner that invokes docker-compose.

Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+

The following period functions are defined:

  • today(), yesterday(), tomorrow()
  • last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
  • last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
  • day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd
>>>, 8, 9)

>>> (, 7, 31),, 8, 6))

>>> (, 8, 10),, 8, 31))

>>>, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

Replacing RobotLab…?

Somewhen around 2002-3, when we first ran the short course “Robotics and the Meaning of Life”, my colleague Jon Rosewell developed a drag and drop text based programming environment – RobotLab – that could be used to programme the behaviour of a Lego Mindstorms/RCX brick controlled robot, or a simulated robot in an inbuilt 2D simulator.


The environment was also used in various OU residential schools, although an update in 2016 saw us move from the old RCX bricks to Lego EV3 robots, and with it a move to the graphical Lego/Labview EV3 programming environment.

RobotLab is still used in the introductory robotics unit of a level 1 undergrad OU course, a unit that is itself an update of the original Robotics and the Meaning of Life short course. And although the software is largely frozen – the screenshot below shows it running under Wineskin on my Mac – it continues to do the job admirably:

  • the environment is drag and drop, to minimise errors, but uses a text based language (inspired originally by Lego scripting code, which it generated to control the actual RCX powered robots);
  • the simulated robot could be configured by the user, with noise being added to sensor inputs and motor outputs, if required, and custom background images could be loaded into the simulator:
  • a remote control panel could be used to control the behaviour of the real – or simulated – robot to provide simple tele-operation of it. A custom remote application for the residential school allowed a real robot to controlled via the remote app, with a delay in the transmission of the signal that could be used to model the signal flight time to a robot on the moon! The RobotLab remote app provided a display to show the current power level of each motor, as well as the values of any sensor readings.

– the RobotLab environment allowed instructions to be stepped through an instruction at a time, in order to support debugging;
– a data logging tool allowed real or simulated logged data to be “uploaded” and viewed as a time series line chart.

Time moves on, however, and we’re now starting to thing about revising the robotics material. We had started looking at an updated, HTML5 version of RobotLab last year, but that effort seems to have stalled. So I’ve started looking for an alternative.

Robot Operating System (ROS)

Following on from a capital infrastructure bid a couple of years ago, we managed to pick up a few Baxter robots that are intended to be used in the “real, remote experiment” OpenSTEM Lab. (Baxter is also demoed at the level 1 engineering residential school.) Baxter runs under ROS, the Robot Operating System, and can be programmed using Python. Both 2D and 3D simulators are available for ROS, which means we could go down the ROS root as a the programming environment for any revised level 1 course.

At this point, it’s also worth saying that the level 1 course is something of a Frankenstein course, including introductory units to Linux and networking, as well as robotics. If the course is rewritten, the current idea is to replace the Linux module with one on more general operating systems. This means that we could try to create a set of modules that all complement each other, yet standalone as separate modules. For example, ROS works on a client server model, which allows us to foreshadow, or counterpoint, ideas arising in the other units.

The relative complexity of the ROS environment means that we can also use it to motivate the idea of using virtual machines for running scientific software with complex dependencies and rather involved installation processes. However, from a quick look at the ROS tools, they do look rather involved for a first intro course and I think would require quite a bit of wrapping to hide some of the complexity.

If you’re interested, here‘s a quick run down of what needs adding to a base Linux 16.04 install:

The simulator and the keyboard remote need launching from separate terminal processes.

Open RobertaLab

A rather more friendly environment we could work with is the blockly based RobertaLab. Scratch is already being used in one of the new first level courses to introduce basic programming, of a sort, so the blocks style environment is one that students will see elsewhere, albeit in a rather simplistic fashion. (I’d argued for using BlockPy instead, but the decision had already been taken…)

RobertaLab runs as an online hosted environment, but the code is openly licensed which means we should be able to run it institutionally, or allow students to run it themselves.

In keeping with complementing the operating systems unit, we could use a Docker containerised version of RobertaLab to allow students to run RobertaLab on their own machines:

A couple of other points to note about RobertaLab. Firstly, it has a simulator, with several predefined background environments. (I’m not sure if new ones can be easily uploaded, but if not the repo can be forked an new ones added form src.)

As with BlockPy, there’s the ability to look at Python script generated from the blocks view. In the EV3Dev selection, Python code compatible with the Ev3dev library is generated. But other programming environments are available too, in which case different code is generated. For example, the EV3lejos selection will generate Java code.

This ability to see the code behind the blocks is a nice stepping stone towards text based programming, although unlike BlockPy I don’t think you can go from code to blocks? The ability to generate code for different languages from the same blocks programme could also be used to make a powerful point, I think?

Whether or not we go with the robotics unit rewrite, I think I’ll try to clone the current RobotLab activities in using the RobertaLab environment, and also add some extra value bits by commenting on the Python code that is generated.

By the by, here’s a clone of the Dockerfile used by exmatrikulator for building the RobertaLab container image:

In the original robotics unit, one of the activities looked at how a simple neural network could be trained using data collected by the robot. I’m not sure if Dale Lane’s Machine Learning for Kids application, which also looks to be written using block.lyScratch. This means it probably can’t be integrated with Open RobertaLab, but even if it isn’t, although it would perhaps make a nice complement to both the use of OpenRobertaLab to control a simple simulated robot, and the use of Scratch in the other level 1 module as an environment for building simple games as a way of motivating the teaching of basic programming concepts.


No-one’s interested in Jupyter notebooks, but folk seem to think Scratch is absolutely bloody lovely for teaching programming to novice ADULTS, so I might as well go with them in adopting that style of interface…

That said, I should probably keep tabs on what Doug Blank is up to with his Conx dabblings, in case there is a sudden interest in using notebooks…

Playgrounds, Playgrounds, Everywhere…

A quick round up of things I would have made time to try to play with in the past, but that I can’t get motivated to explore, partly because there’s nowhere I can imagine using it, and partly because, as blog traffic to this blog dwindles, there’s no real reason to share. So there here just as timeline markers in the space of tech evolution, if nothing else.

Play With Docker Classroom

Play With Docker Classroom provides an online interactive playground for getting started playing with classroom. A ranges of tutorials guide you through the commands you need to run to explore the Docker ecosystem and get your own demo services up and running.

You’re provided with a complete docker environment, running in the cloud (so no need to install anything…) and it also looks as if you can go off-piste. For example, I set one of my own OpenRefine containers running:

…had a look at the URL used to preview the hosted web services launched as part of one of the official tutorials, and made the guess that I could change the port number in the subdomain to access my own OpenRefine container…


Python Anywhere

Suspecting once again I’ve been as-if sent to Coventry by the rest of my department, one of the few people that still talks to me in the OU tipped me off to the online hosted Python Anywhere environment. The free plan offers you a small hosted Python environment to play with, accessed via a browser IDE, and that allows you to run a small web app, for example. There’s a linked MySQL database for preserving state, and the ability to schedule jobs – so handy for managing the odd scraper, perhaps (though I’m not sure what external URLs you can access?)

The relatively cheap (and up) paid for accounts also offer a Jupyter notebook environment – it’s interesting this isn’t part of the free plan, which makes me wonder if that environment works as an effective enticement to go for the upgrade?

The Python Anywhere environment also looks as if it’s geared up to educational use – student’s signing up to the service can nominate another account as their “teacher”, which allows the teacher to see the student files, as well as get a realtime shared view of their console.

(Methinks it’d be quite interesting to have a full feature by feature comparison of Python Anywhere and CoCalc (SageMathCloud, as was…)


Amazon Web Services (AWS) Service Application Model (SAM) describes a way of hosting serverless applications using Amazon Lambda functions. A recent blogpost describes a local Docker testing environment that allows you to develop and test SAM compliant applications on your local machine and then deploy them to AWS. I always find working with AWS a real pain to set up, so if being able to develop and test locally means I can try to get apps working first and then worry about negotiating the AWS permissions and UI minefield if I ever want to try to actually run the thing on AWS, I may be more likely to play a bit with AWS SAM apps…

Tinkering With Neural Networks

On my to do list is writing some updated teaching material on neural networks. As part of the “research” process, I keep meaning to give some of the more popular deep learning models – PyTorch, Tensorflow and ConvNet, perhaps – a spin, to see if we can simplify them to an introductory teaching level.

A first thing to do is evaluate the various playgrounds and demos, such as the Tensorflow NeuralNetwork Playground:

Another, possibly more useful, approach might be to start with some of the “train a neural network in your browser” Javascript libraries and see how easy it is to put together simple web app demos using just those libraries.

For example, ConvNetJS:

Or deeplearn.js:

(I should probably also look at Dale Lane’s Machine Learning For Kids site, that uses Scratch to build simple ML powered apps.)

LearnR R/Shiny Tutorial Builder

The LearnR package is a new-ish package for developing interactive tutorials in RStudio. (Hmmm… I wonder if the framework lets you write tutorial code using language kernels other than R?)

Just as Jupyter notebooks provided a new sort of interactive environment that I don’t think we have even begun to explore properly yet as an interactive teaching and learning medium, I think this could take a fair bit of playing with to get a proper feel for how to do things most effectively. Which isn’t going to happen, of course.

Plotly Dash

“Reactive Web Apps for Python”, apparently… This perhaps isn’t so much of interest as an ed tech, but it’s something I keep meaning to play with as a possible Python/Jupyter alternative to R/Shiny.


Should play with these, can’t be a****d…

Jupyter / IPython Notebook Textbook Companions

I still remember the first time I was introduced to Jupyter (then IPython) notebooks – a demonstration at the back of a large lecture room by Alfred Essa of his “Rwandan Tragedy” notebook:

(I think this was a OpenEd12 in Vancouver (“Beyond Content”), back in the days when I used to blag entry to conferences, somehow or other, so this would date it to October 2012…).

I don’t recall offhand what my immediate reaction was (I’d like to think it was unbridled enthusiasm, but I’m not sure I completely grokked it…). A scan of my laptop (commissioned over summer 2013?) shows I have notebook files dating back to at least December 2013 (Jan 2014 on Github gists), at which point I seemed to really start playing…

In that period of time, I suspect things had moved on quite a bit since seeing the Rwanda demo, such as the embedding of output charts in the notebooks. (Of course, now you can embed, and generate embeddings, of pretty much anything, as well as being able to easily add in interactive widgets into a notebook to control embedded interactives using code defined in the same notebook…)

So it seems it took me some time to starting to explore them. But when I did start to use them, there was no going back…

I also remember meeting Alistair Willis in the OU Berrill café, mooting plans for the course that was to become TM351 (the first module team meeting was October 2013, perhaps?) , and at some point the idea of using IPython notebooks for the course came up, though I’m not sure I had any experience of using the notebooks – just that they seemed like something worth exploring…

Alistair quickly became a fellow early believer in the notebooks, and since then, the FutureLearn course Learn to Code for Data Analysis, led by Michel Wermelinger, has also used them.

But that, to my knowledge, and to my shame, is as far as it’s gone in the OU.

The new first year equivalent introductory computing courses use Scratch (I tried to argue for BlockPy, but the decision had been made to go with what had been used before….) and IDLE, and whilst there was some talk of Python coding in the new level 2 and/or level 3 Engineering courses, I’m not sure how that’s progressed.

I’ve no idea what OUr Science courses are up to – or Maths courses. Or courses that use statistics. Or interactive maps.

Anyway, over the last few years I’ve come to live in Jupyter notebooks – they’re great for trying stuff out, keeping records of play and experiments in a way that the interactive command line isn’t (even if you do save all your history files), and can be used for sharing complete, worked, and working recipes.

As my own timeline suggests, being aware of the notebooks and actually buying into them, takes – using them. Which is maybe one reason why adoption in the OU has been slow: fear of the new.

Which is a shame – because there’s a great ecosystem developing around the use of notebooks.

For example, yesterday, whilst search for “tractive effort gearing ipynb”, trying to find notebook examples of tractive effort curves (a phrase I picked up from Racecar Engineering mag – race engineers cut their teeth on the maths of finding gear ratios by calculating such curves, apparently…) I came across this notebook file:

Not rendered, but that’s easy enough using nbviewer:

which gives this:

Hmmm… a set of worked examples from a textbook. What textbook?, I wondered, and went up the URL path:

Interesting… So how active a project is this?

Hmm… really interesting… The examples may be pared down, but that means they can also be worked up. (It look like there’s a Github repo, which I guess you can fork and then make pull requests back to with worked examples for new books, or improved notebooks for current ones?) And they show how to go about solving a wide range of problems by scripting them. (This is one reason why I think computing folk don’t like notebooks. They aren’t really interested in folk using simple scripting to get simple things done. Which is also the reason why computing folk are the worst people to try to teach computing to the masses, who don’t know code can be used, a line at a time, to get things done, and who don’t see the point in being taught the stuff that the computing folk want to teach. Which is old school computing principles, rather than TECHNOLOGY THAT’S USEFUL TO FOLK.)

Whatever…. As a “digital first” organisation, I keep wondering why we’re not buying into Jupyter as a piece of freakin’ awesome edtech?! (By the by, a history of the IPython notebook project can be found here.)

If nothing else, I’d be really interested to see research from OUr digital innovation leaders why there’s no interest in adopting Jupyter notebooks in the institution?

See also: I Just Don’t Understand Why…