Category: Tinkering

IPython Magic for Controlling the V-REP Robot Simulator from Jupyter notebooks

Whilst exploring how we might be able to use Jupyter notebooks hooked up to the Coppelia Robotics V-REP robot simulator, it struck me that we needed a fair amount of boilerplate stuff to get the simulator loaded with an appropriate scene file and the a connection made to the simulator from the notebook so we could script the robot actions from the notebooks.

My first approach to trying to simplify presentation was to create some “self-documenting” notebooks that could be used to set-up necessary environmental variables and import default classes and functions:

The %run cell magic loads and runs the referenced notebooks, which can also be inspected (and modified) by students.

(To try to minimise the risk of students introducing breaking changes into the imported notebooks, we could also lock the cells as read-only in the notebooks. Whilst this requires an extension to be installed to implement the read-only behaviour, the intention is that we distribute a customised Jupyter notebook environment to students.)

The loadSceneRelativeToClient() function loads the specified scene into the simulator. Note that this scene should contain a robot model. Once the connection to the simulator is made, a robot object can be instantiated using the connection details. The robot class should contain the definitions required to control the robot model in the loaded in scene.

Setting up the connection to the simulator is a bit of a faff, and when code cell execution is stopped we can get an annoying KeyboardInterrupt report:

We can defend against the KeyboardInterrupt by wrapping the code execution in a try/except block:

try:
    with VRep.connect("127.0.0.1", 19997) as api:
        robot = PioneerP3DXL(api)
        while True:
            #do stuff
            pass
except KeyboardInterrupt:
    pass

But it struck me that it would be much nicer to be able to use some magic along the lines of the following, in which we set up the simulator with a scene, identify the robot we want to control, automatically connect to the simulator and then just run the robot control program:

So here’s a first attempt at some IPython cell magic to do that:

from pyrep import VRep
from __future__ import print_function
from IPython.core.magic import (Magics, magics_class, line_magic,
                                cell_magic, line_cell_magic)
import shlex

# The class MUST call this class decorator at creation time
@magics_class
class Vrep_Sim(Magics):

    @cell_magic
    def vrepsim(self, line, cell):
        "V-REP magic"

        #Use shlex.split to handle quoted strings containing a space character
        loadSceneRelativeToClient(shlex.split(line)[0])

        #Get the robot class from the string
        robotclass=eval(shlex.split(line)[1])

        #Handle default IP address and port settings; grab from globals if set
        ip = self.shell.user_ns['vrep_ip'] if 'vrep_ip' in self.shell.user_ns else '127.0.0.1'
        port = self.shell.user_ns['vrep_port'] if 'vrep_port' in self.shell.user_ns else 19997

        #The try/except block exits form a keyboard interrupt cleanly
        try:
            #Create a connection to the simulator
            with VRep.connect(ip, port) as api:
                #Set the robot variable to an instance of the desired robot class
                robot = robotclass(api)
                #Execute the cell code - define robot commands as calls on: robot
                exec(cell)
        except KeyboardInterrupt:
            pass

    #@line_cell_magic
    @line_magic
    def vrep_robot_methods(self, line):
        "Show methods"
        robotclass = eval(line)
        methods = [method for method in dir(robotclass) if not method.startswith('_')]
        print('Methods available in {}:\n\t{}'.format(robotclass.__name__ , '\n\t'.join(methods)))

#Could install as magic separately
ip = get_ipython()
ip.register_magics(Vrep_Sim)

I’ve also added a bit of line magic to display the methods defined on a robot model class:

The tension now is a pedagogical one: for example, should I be providing students with the robot model, or should they be building up the various control functions (.move_forwards(), .turn_left(), etc.) themselves?

I’m also wondering whether I should push the while True: component into the magic? On balance, I think students need to see it in their code block because getting them to think about control loops rather than one shot execution of command statements is something they often don’t get the first, or even second, time round. But for reducing clutter, it’d make for far cleaner cell block code.

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.

Whatever.

I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API

A couple of days ago, I noticed that Chris Newell had shared the code for his ergast Formula 1 motor racing results database that I use heavily for my F1DtataJunkie doodles. The code is a PHP application that pulls data from the ergast database, which is regularly shared as a MySQL database dump. So I raised an issue asking if a docker containerised version of the application was available, and Chris replied that there wasn’t, but if I wanted to look at creating one…?

…which I did. After a few false starts, I came up with the solution Chris has since pulled into his ergast-f1-api repo.

The pattern is quite a handy one, and I think reusable – I’ll give it a go for spinning up my own API to look up ONS Geography codes when I get a chance – so what’s involved?

The dockerised application is built from two components launched using docker-compose, a MySQL container and an Apache server configured with PHP5:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"

web:
  build: ./lamp
  #image: nimmis/apache-php5
  ports:
    - '8000:80'
  volumes:
    - ./webroot:/var/www/html
    - ./php/api:/php/api
    - ./logs:/var/log/apache2
  links:
    - ergastdb:ergastdb

The server runs the application, makes requests from the linked database container, and returns the result.

As part of my f1datajunkie tinkering, I’d put together a Dockerfle for populating a MySQL database with Chris’ database dump some time ago, so I could reuse that directly.

Which meant all I had to do was get the application up and running… Chris’ original instructions around the API server application were to place the application files into “the root directory” and also add in an Apache .htaccess URL rewrite, which he provided.

Simple… or maybe not..?! Not being an Apache user, it took me a bit of time to actually get things up and running, so here are some of the gotchas that caught me out:

  • where’s the “root directory”?
  • where should the .htaccess file?

Running a couple of simple tests identified the root directory as the root for the files served by the webserver, and a quick search revealed the .htaccess file should go in the same location.

But the redirects didn’t work…

As a test, I wrote a simple rewrite rule that should have redirected any url to the same test file – but no joy…

A bit more testing suggested I needed to enable the mod_rewrite plugin, which I did in the appropriate Dockerfile. But still no joy. Because to use patches like the .htaccess file in a server directory also requires allowing “Overrides” to the base Apache settings – which I did (via a crib) by rewriting the Apache config file, again via the Dockerfile that builds the server container image.

RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/ s/AllowOverride None/AllowOverride All/' /etc/apache2/apache2.conf

RUN a2enmod rewrite

Then the database connection didn’t work… But that was easily fixed: I’d forgotten to change the permission is in the appropriate application file.

Now running the docker-compose file with the command docker-compose up --build -d and it fires up two linked containers. The API is then available via a browser on http://localhost:8000/api/f1 using calls of the form http://localhost:8000/api/f1/2015.json.

So now I have a minimal working example of a PHP application powered by a MySQL database driving a simple web API. Which I can crib from for my own simple APIs…

As to how it could be improved, there are a couple of obvious things:

  • the containers are intended for personal, local use: the database is accessed as root and should be properly secured;
  • I haven’t got a pattern for updating the database when a new database image is released. The current workaround would be to destroy the containers, and the database volume, then rebuild the images and the containers.

PS to use the local ergast API server from R with my ergastR package, I needed to tweak the package to allow me to specify the path the API server. I’ll post details of how I did that later…

Simple Text Analysis Using Python – Identifying Named Entities, Tagging, Fuzzy String Matching and Topic Modelling

Text processing is not really my thing, but here’s a round-up of some basic recipes that allow you to get started with some quick’n’dirty tricks for identifying named entities in a document, and tagging entities in documents.

In this post, I’ll briefly review some getting started code for:

  • performing simple entity extraction from a text; for example, when presented with a text document, label it with named entities (people, places, organisations); entity extraction is typically based on statistical models that rely on document features such as correct capitalisation of names to work correctly;
  • tagging documents that contain exact matches of specified terms: in this case, we have a list of specific text strings we are interested in (for example, names of people or companies) and we want to know if there are exact matches in the text document and where those matches occur in the document;
  • partial and fuzzing string matching of specified entities in a text: in this case, we may want to know whether something resembling a specified text string occurs in the document (for example, mis0spellings of name);
  • topic modelling: the identification, using statistical models, of “topic terms” that appear across a set of documents.

You can find a gist containing a notebook that summarises the code here.

Simple named entity recognition

spaCy is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.

According to the spaCy entity recognition documentation, the built in model recognises the following types of entity:

  • PERSON People, including fictional.
  • NORP Nationalities or religious or political groups.
  • FACILITY Buildings, airports, highways, bridges, etc.
  • ORG Companies, agencies, institutions, etc.
  • GPE Countries, cities, states. (That is, Geo-Political Entitites)
  • LOC Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT Objects, vehicles, foods, etc. (Not services.)
  • EVENT Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART Titles of books, songs, etc.
  • LANGUAGE Any named language.
  • LAW A legislation related entity(?)

Quantities are also recognised:

  • DATE Absolute or relative dates or periods.
  • TIME Times smaller than a day.
  • PERCENT Percentage, including “%”.
  • MONEY Monetary values, including unit.
  • QUANTITY Measurements, as of weight or distance.
  • ORDINAL “first”, “second”, etc.
  • CARDINAL Numerals that do not fall under another type.

Custom models can also be trained, but this requires annotated training documents.

#!pip3 install spacy
from spacy.en import English
parser = English()
example='''
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.
'''
#Code "borrowed" from somewhere?!
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    tags={}
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
        else:
            tags[term].append((entity.label, entity.label_))
    print(tags)
entities(example)
-------------- entities only ---------------
{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}
q= "Bob Smith was in the Houses of Parliament the other day"
entities(q)
-------------- entities only ---------------
{'Bob Smith': [(377, 'PERSON')]}

Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.

entities(q.lower())
-------------- entities only ---------------
{}

polyglot

A simplistic, and quite slow, tagger, supporting limited recognition of Locations (I-LOC), Organizations (I-ORG) and Persons (I-PER).

#!pip3 install polyglot

##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
##Linux
#apt-get install libicu-dev
!polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
from polyglot.text import Text

text = Text(example)
text.entities
[I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Halifax']),
 I-LOC(['Girvan']),
 I-LOC(['Poland']),
 I-PER(['Nestlé']),
 I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Newcastle']),
 I-ORG(['EU']),
 I-ORG(['EU']),
 I-ORG(['Government']),
 I-ORG(['GMB']),
 I-LOC(['Nestlé'])]
Text(q).entities
[I-PER(['Bob', 'Smith'])]

Partial Matching Specific Entities

Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs’ names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.

To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.

Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.

## The following recipe was hinted at via @pudo

#!pip3 install pyahocorasick
#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py

First, construct an automaton that identifies the terms you want to detect in the target text.

from ahocorasick import Automaton

A=Automaton()
A.add_word("Europe",('VOCAB','Europe'))
A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson (LC)'))

A.make_automaton()
q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
    print(item, q2[:item[0]+1])
(4, ('PERSON', 'Boris Johnson')) Boris
(12, ('PERSON', 'Boris Johnson')) Boris Johnson
(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe
(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe
(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union

Once again, case is important.

q2l = q2.lower()
for item in A.iter(q2l):
    print(item, q2l[:item[0]+1])
(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson

We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:

A=Automaton()
A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))

A.make_automaton()
for item in A.iter(q2):
    start=item[0]-item[1][0][1]+1
    end=item[0]+1
    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo
(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we
(31, (('VOCAB', 6), 'Europe')) to *Europe* to
(60, (('VOCAB', 6), 'Europe')) he *Europe*an 
(68, (('VOCAB', 14), 'European Union')) he *European Union*

Fuzzy String Matching

Whilst the Aho-Corasick approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that almost match terms in specific set of terms.

Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.

fuzzyset

The python fuzzyset package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.

For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.

A confidence value expresses the degree of match to terms in the fuzzy match set list.

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
([(0.8666666666666667, 'Boris Johnson')],
 [(0.8333333333333334, 'Diane Abbott')],
 [(0.23076923076923073, 'Diane Abbott')])

fuzzywuzzy

If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy library. Once again, we spcify a list of target items we want to try to match against.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']

q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
process.extract(q,terms)
[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]

By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:

process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
[('Houses of Parliament', 90), ('Boris Johnson', 85)]

A range of fuzzy match scroing algorithms are supported:

  • WRatio – measure of the sequences’ similarity between 0 and 100, using different algorithms
  • QRatio – Quick ratio comparison between two strings
  • UWRatio – a measure of the sequences’ similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode
  • UQRatio – Unicode quick ratio
  • ratio
  • `partial_ratio – ratio of the most similar substring as a number between 0 and 100
  • token_sort_ratio – a measure of the sequences’ similarity between 0 and 100 but sorting the token before comparing
  • partial_token_set_ratio
  • partial_token_sort_ratio – ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing

More usefully, perhaps, is to return items that match above a particular confidence level:

process.extractBests(q,terms,score_cutoff=90)
[('Houses of Parliament', 90), ('Diane Abbott', 90)]

However, one problem with the fuzzywuzzy matcher is that it doesn’t tell us where in the supplied text string the match occurred, or what string in the text was matched.

The fuzzywuzzy package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)

names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
process.dedupe(names, threshold=80)
['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']

It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:

import hashlib

clusters={}
fuzzed=[]
for t in names:
    fuzzyset=process.extractBests(t,names,score_cutoff=85)
    #Generate a key based on the sorted members of the set
    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
    keytxt=''.join(keyvals)
    key=hashlib.md5(keytxt).hexdigest()
    if len(keyvals)>1 and key not in fuzzed:
        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
        fuzzed.append(key)
for cluster in clusters:
    print(clusters[cluster])
[('Diane Abbott', 100), ('Diana Abbot', 87)]
[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]

OpenRefine Clustering

As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the refine-client.py client libary.

In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.

#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git
#NOTE - this requires a python 2 kernel
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
    for t in ['Name']+names+['Boris Johnstone']:
        f.write(t+ '\n')
!cat {project_file}
Name
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone
p=orefine.new_project(project_file=project_file)
p.columns
[u'Name']

OpenRefine supports a range of clustering functions:

- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')
for cluster in clusters:
    print(cluster)
[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]
[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]

Topic Models

Topic models are statistical models that attempts to categorise different “topics” that occur across a set of docments.

Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents.

gensim

#!pip3 install gensim
#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

def print_top_terms(lda, num_terms=10):
    txt=[]
    num_terms=min([num_terms,lda.num_topics])
    for i in range(0, num_terms):
        terms = [term for term,val in lda.show_topic(i,num_terms)]
        txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
    return '\n'.join(txt)

To start with, let’s create a list of dummy documents and then generate word lists for each document.

docs=['The banks still have a lot to answer for the financial crisis.',
     'This MP and that Member of Parliament were both active in the debate.',
     'The companies that work in finance need to be responsible.',
     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
     'Corporate finance is a big responsibility.']

#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]

#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]

Now we can generate the topic models from the list of word lists.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: parliament debate active
     - top 3 terms for topic #1: responsible work need
     - top 3 terms for topic #2: corporate big responsibility

The model is randomised – if we run it again we are likely to get a different result.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: finance corporate responsibility
     - top 3 terms for topic #1: participants quality high
     - top 3 terms for topic #2: member mp active

Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+https://github.com/psychemedia/python-natural-time-periods.git

The following period functions are defined:

  • today(), yesterday(), tomorrow()
  • last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
  • last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
  • day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd

ntpd.today()
>>> datetime.date(2017, 8, 9)

ntpd.last_week()
>>> (datetime.date(2017, 7, 31), datetime.date(2017, 8, 6))

ntpd.later_this_month()
>>> (datetime.date(2017, 8, 10), datetime.date(2017, 8, 31))

ntpd.day_lastweek(ntpd.MON)
>>> datetime.date(2017, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some  do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY  are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb

Fragment – Diagramming the Structure of a Python dict Using BlockDiag, & Some Quick Reflections on Computing Education

As a throwaway diagram in a piece of teaching material I wanted to visualise the structure of a Python dict. One tool I use for generating simple block diagrams is BlockDiag. This uses a simple notation for generating box and arrow diagrams:

So how can we get the key structure from a nested Python dict in that form? There may well be a Python method somewhere for grabbing this information, but it was just a quick coffee break puzzle to write a thing to grab that data and represent it as required:

def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            t.append('"{root}" -> "{k}";'.format(root=root,k=k))
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t

r={'a':{'b':1, 'c':{'d':2,'e':3}}}
print('{{ {graph} }}'.format(graph='\n'.join(getChildren(r)))) 

#------
{ "a" -> "b";
"a" -> "c";
"c" -> "d";
"c" -> "e"; }

There are probably better ways of doing it*, but that’s not necessarily the point. Which is a point I realised chatting to a colleague earlier today: I’m not that interested in the teaching of formal computing approaches as a way of training enterprise developers. Nor am I interested in the teaching of computing through contrived toy examples. What I’m far more interested in is helping students do without us; and students, at that, who have end-user computing needs that they want to be able to satisfy in whatever domain they end up in.

  • So, for example:
def getChildren(d,t=None,root=None):
    if t is None: t=[]
    if root is None:
        for k in d:
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    else:
        for k in d:
            s='"{root}" -> "{k}";'.format(root=root,k=k)
            if s not in t: t.append(s)
            if isinstance(d[k],dict):
                t=getChildren(d[k],t,k)
    return t
 
r={'a':{'b':1, 'c':{'d':2,'e':3}}}
l=[r,{'a':{'b':1, 'c':{'e':3}}}]
o=[]
for z in l:
    o=getChildren(z,o)
o

#['"a" -> "b";', '"a" -> "c";', '"c" -> "d";', '"c" -> "e";']

Which is to say, not (in the first instance) enterprise level, production quality code. It’s code to get stuff done. End-user application development code. Personal, disposable/throwaway, ad hoc productivity tool development. Scruffy code that lets you use bits of string and gaffer tape and chunks of other people’s code to solve a particular problem.

But that’s not to say the code has to stay ropey… In testing the first attempt at the above code, it lacked the guards that checked whether a variable was a dict, at which point it failed whenever a literal value was encountered. There may well be other things that are broken about it but I can fix those as they crop up (because I know the sort of output I expect to see*, and if I don’t get it, I can try to fix it.) I also had to go and look up how to include literal curly brackets in a python formatted string (double them up) for the print statement. But that’s okay. That’s just syntax… Knowing that I should be able to print the literal brackets was the important thing… And that’s all part of what I think a lot of our curriculum lacks – enthusing folk, making them curious, getting them to have expectations about what is and isn’t and should be possible**, and then being able to act on that.

* informal test driven end user software application development…?!;-)
** with some personal ethics about what may be possible but shouldn’t be pursued and should be lobbied against…

Using Jupyter Notebooks For Assessment – Export as Word (.docx) Extension

One of the things we still haven’t properly worked out in our Data management and analysis (TM351 course is how best to handle Jupyter notebook based assignments. The assignments are set using a notebook to describe the tasks to be completed and completed by the student. We then need some mechanism for:

  • students to submit the assessment electronically;
  • markers mark assessments for their students: if the document contains a lot of OU text, it can be hard for the marker to locate the student text;
  • markers may provide on-script feedback; this means the marker needs to be able to edit the document and make changes/annotations.
  • markers return scripts to students;
  • students read feedback – so they need to be able to locate and distinguish the marker feedback within the document.

One Frankenstein process we tried was for students to save a Jupyter notebook file as a Markdown or HTML document and then convert it to a Microsoft Word document using pandoc.

This document could then be submitted and marked in a traditional way, with markers using comments and track chances to annotate the student script. Unfortunately, our original 32 bit VM meant we had to use an old version of pandoc, with the result that tabular data was not handled at all well in the conversion-to-Word process.

Updating to a 64 bit virtual machine means we can update pandoc, and the Word document conversion is now much smoother. However, the conversion process still requires students to export the word document as HTML and then use pandoc to convert the HTML to to the Microsoft Word .docx format. (The Jupyter nbconvert utility does not currently export to Word.)

So to make things a little easier, here’s my first attempt at a Download Jupyter Notecbook as Word (.docx) extension to do just that. It makes use of the Jupyter notebook custom bundler extensions API which allows you to add additional options to the notebook File -> Download menu option. The code I used was also cribbed from the dashboards_bundlers which converts a notebook to a dashboard and then downloads it.

[There’s now a Github repo: innovationOUtside/nb_extension_wordexport]

One thing it doesn’t handle at the moment are things like embedded interactive maps. I’ve previously come up with a workaround for generating static images of interactive maps created using the folium package by using selenium to render the map and grab a screenshot of it; I’m not sure if that would work in our headless VM, though? (One to try, I guess?) There’s also a related thread in the folium repo issue tracker.

The above script is placed in a wordexport folder inside a package folder containing a simple setup.py script:

from setuptools import setup

setup(name='wordexport',
      version='0.0.1',
      description='Export Jupyter notebook as .docx file',
      author='Tony Hirst',
      author_email='tony.hirst@open.ac.uk',
      license='MIT',
      packages=['wordexport'],
      zip_safe=False)

The package can be installed and the extension enabled using a riff along the lines of the following command-line commands:

echo "...wordexport install..."
#Install the wordexport (.docx exporter) extension package
pip3 install --upgrade --force-reinstall ${THISDIR}/jupyter_custom_files/nbextensions/wordexport

#Enable the wordexport extension
jupyter bundlerextension enable --py wordexport.wordexport  --sys-prefix
echo "...wordexport done"

Restart the Jupyter server after enabling the extension, and the result should be a new MS Word (.docx) option in the notebook File -> Download menu option.

Querying Large CSV Files With Apache Drill

Via a post on the rud.is blog – Drilling Into CSVs — Teaser Trailer – I came across a handy looking Apache tool: Apache Drill. A Java powered service, Apache Drill allows you to query large CSV and JSON files (as well as a range of other data backends) using SQL, without any particular manipulation of the target data files. (The notes also suggest you can query directly over a set of CSV files (with the same columns?) in a directory, though I haven’t tried that yet…)

To give it a go, I dowloaded Evan Odell’s Hansard dataset which comes in as a CSV file at just over 3GB.

Installing Apache Drill, and running it from the command line – ./apache-drill-1.10.0/bin/drill-embedded – it was easy enough to start running queries from the off (Querying Plain Text Files):

SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv` LIMIT 3;

By default, queries from a CSV file ignore headers and treat all rows equally. Queries over particular columns can be executed by referring to numbered columns in he form COLUMNS[0], COLUMNS[1], etc. (Querying Plain Text Files).  However, Bob Rudis’ blog hinted there was a way to configure the server to use the first row of a CSV file as a header row. In particular, the Data Source Plugin Configuration Basics docs page describes how the CSV data source configuration can be amended with the clauses "skipFirstLine": true, "extractHeader": true to allow CSV files to be queried with the header row intact.

The configuration file for the live server can be amended via a web page published by the running Apache Drill service, by default on localhost port 8047:

Updating the configuration means we can start to run named column queries:

The config files are actually saved to a temporary location – /tmp/drill. If the (updated) files are copied to a persistent location – mv /tmp/drill /my/configpath – and the drill-override.conf file updated with the setting drill.exec: {sys.store.provider.local.path="/my/configpath"}, the (updated) configuration files will in future be uploaded from that location, rather than temporary default config files being created for each new server session (docs: Storage Plugin Registration).

Bob Rudis’ post also suggested that more efficient queries could be run by converting the CSV data file to a parquet data format, and then querying over that:

CREATE TABLE dfs.tmp.`/senti_post_v2.parquet` AS SELECT * FROM dfs.`/Users/ajh59/Documents/parlidata/senti_post_v2.csv`;

This creates a new parquet data folder /tmp/senti_post_v2.parquet. This can then be queried as for the CSV file:

SELECT gender, count(*) FROM dfs.tmp.`/senti_post_v2.parquet` GROUP BY gender;

…but with a significant speed up, on some queries at least:

To quit, !​exit.

And finally, to make using the Apache Drill service easier to use from code, wrapper libraries are available for R – sergeant R package – and Python – pydrill package.