Category: Tinkering

IPython Magic for Controlling the V-REP Robot Simulator from Jupyter notebooks

Whilst exploring how we might be able to use Jupyter notebooks hooked up to the Coppelia Robotics V-REP robot simulator, it struck me that we needed a fair amount of boilerplate stuff to get the simulator loaded with an appropriate scene file and the a connection made to the simulator from the notebook so we could script the robot actions from the notebooks.

My first approach to trying to simplify presentation was to create some “self-documenting” notebooks that could be used to set-up necessary environmental variables and import default classes and functions:

The %run cell magic loads and runs the referenced notebooks, which can also be inspected (and modified) by students.

(To try to minimise the risk of students introducing breaking changes into the imported notebooks, we could also lock the cells as read-only in the notebooks. Whilst this requires an extension to be installed to implement the read-only behaviour, the intention is that we distribute a customised Jupyter notebook environment to students.)

The loadSceneRelativeToClient() function loads the specified scene into the simulator. Note that this scene should contain a robot model. Once the connection to the simulator is made, a robot object can be instantiated using the connection details. The robot class should contain the definitions required to control the robot model in the loaded in scene.

Setting up the connection to the simulator is a bit of a faff, and when code cell execution is stopped we can get an annoying KeyboardInterrupt report:

We can defend against the KeyboardInterrupt by wrapping the code execution in a try/except block:

try:
    with VRep.connect("127.0.0.1", 19997) as api:
        robot = PioneerP3DXL(api)
        while True:
            #do stuff
            pass
except KeyboardInterrupt:
    pass

But it struck me that it would be much nicer to be able to use some magic along the lines of the following, in which we set up the simulator with a scene, identify the robot we want to control, automatically connect to the simulator and then just run the robot control program:

So here’s a first attempt at some IPython cell magic to do that:

from pyrep import VRep
from __future__ import print_function
from IPython.core.magic import (Magics, magics_class, line_magic,
                                cell_magic, line_cell_magic)
import shlex

# The class MUST call this class decorator at creation time
@magics_class
class Vrep_Sim(Magics):

    @cell_magic
    def vrepsim(self, line, cell):
        "V-REP magic"

        #Use shlex.split to handle quoted strings containing a space character
        loadSceneRelativeToClient(shlex.split(line)[0])

        #Get the robot class from the string
        robotclass=eval(shlex.split(line)[1])

        #Handle default IP address and port settings; grab from globals if set
        ip = self.shell.user_ns['vrep_ip'] if 'vrep_ip' in self.shell.user_ns else '127.0.0.1'
        port = self.shell.user_ns['vrep_port'] if 'vrep_port' in self.shell.user_ns else 19997

        #The try/except block exits form a keyboard interrupt cleanly
        try:
            #Create a connection to the simulator
            with VRep.connect(ip, port) as api:
                #Set the robot variable to an instance of the desired robot class
                robot = robotclass(api)
                #Execute the cell code - define robot commands as calls on: robot
                exec(cell)
        except KeyboardInterrupt:
            pass

    #@line_cell_magic
    @line_magic
    def vrep_robot_methods(self, line):
        "Show methods"
        robotclass = eval(line)
        methods = [method for method in dir(robotclass) if not method.startswith('_')]
        print('Methods available in {}:\n\t{}'.format(robotclass.__name__ , '\n\t'.join(methods)))

#Could install as magic separately
ip = get_ipython()
ip.register_magics(Vrep_Sim)

I’ve also added a bit of line magic to display the methods defined on a robot model class:

The tension now is a pedagogical one: for example, should I be providing students with the robot model, or should they be building up the various control functions (.move_forwards(), .turn_left(), etc.) themselves?

I’m also wondering whether I should push the while True: component into the magic? On balance, I think students need to see it in their code block because getting them to think about control loops rather than one shot execution of command statements is something they often don’t get the first, or even second, time round. But for reducing clutter, it’d make for far cleaner cell block code.

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.

Whatever.

I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API

A couple of days ago, I noticed that Chris Newell had shared the code for his ergast Formula 1 motor racing results database that I use heavily for my F1DtataJunkie doodles. The code is a PHP application that pulls data from the ergast database, which is regularly shared as a MySQL database dump. So I raised an issue asking if a docker containerised version of the application was available, and Chris replied that there wasn’t, but if I wanted to look at creating one…?

…which I did. After a few false starts, I came up with the solution Chris has since pulled into his ergast-f1-api repo.

The pattern is quite a handy one, and I think reusable – I’ll give it a go for spinning up my own API to look up ONS Geography codes when I get a chance – so what’s involved?

The dockerised application is built from two components launched using docker-compose, a MySQL container and an Apache server configured with PHP5:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"

web:
  build: ./lamp
  #image: nimmis/apache-php5
  ports:
    - '8000:80'
  volumes:
    - ./webroot:/var/www/html
    - ./php/api:/php/api
    - ./logs:/var/log/apache2
  links:
    - ergastdb:ergastdb

The server runs the application, makes requests from the linked database container, and returns the result.

As part of my f1datajunkie tinkering, I’d put together a Dockerfle for populating a MySQL database with Chris’ database dump some time ago, so I could reuse that directly.

Which meant all I had to do was get the application up and running… Chris’ original instructions around the API server application were to place the application files into “the root directory” and also add in an Apache .htaccess URL rewrite, which he provided.

Simple… or maybe not..?! Not being an Apache user, it took me a bit of time to actually get things up and running, so here are some of the gotchas that caught me out:

  • where’s the “root directory”?
  • where should the .htaccess file?

Running a couple of simple tests identified the root directory as the root for the files served by the webserver, and a quick search revealed the .htaccess file should go in the same location.

But the redirects didn’t work…

As a test, I wrote a simple rewrite rule that should have redirected any url to the same test file – but no joy…

A bit more testing suggested I needed to enable the mod_rewrite plugin, which I did in the appropriate Dockerfile. But still no joy. Because to use patches like the .htaccess file in a server directory also requires allowing “Overrides” to the base Apache settings – which I did (via a crib) by rewriting the Apache config file, again via the Dockerfile that builds the server container image.

RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/ s/AllowOverride None/AllowOverride All/' /etc/apache2/apache2.conf

RUN a2enmod rewrite

Then the database connection didn’t work… But that was easily fixed: I’d forgotten to change the permission is in the appropriate application file.

Now running the docker-compose file with the command docker-compose up --build -d and it fires up two linked containers. The API is then available via a browser on http://localhost:8000/api/f1 using calls of the form http://localhost:8000/api/f1/2015.json.

So now I have a minimal working example of a PHP application powered by a MySQL database driving a simple web API. Which I can crib from for my own simple APIs…

As to how it could be improved, there are a couple of obvious things:

  • the containers are intended for personal, local use: the database is accessed as root and should be properly secured;
  • I haven’t got a pattern for updating the database when a new database image is released. The current workaround would be to destroy the containers, and the database volume, then rebuild the images and the containers.

PS to use the local ergast API server from R with my ergastR package, I needed to tweak the package to allow me to specify the path the API server. I’ll post details of how I did that later…

Simple Text Analysis Using Python – Identifying Named Entities, Tagging, Fuzzy String Matching and Topic Modelling

Text processing is not really my thing, but here’s a round-up of some basic recipes that allow you to get started with some quick’n’dirty tricks for identifying named entities in a document, and tagging entities in documents.

In this post, I’ll briefly review some getting started code for:

  • performing simple entity extraction from a text; for example, when presented with a text document, label it with named entities (people, places, organisations); entity extraction is typically based on statistical models that rely on document features such as correct capitalisation of names to work correctly;
  • tagging documents that contain exact matches of specified terms: in this case, we have a list of specific text strings we are interested in (for example, names of people or companies) and we want to know if there are exact matches in the text document and where those matches occur in the document;
  • partial and fuzzing string matching of specified entities in a text: in this case, we may want to know whether something resembling a specified text string occurs in the document (for example, mis0spellings of name);
  • topic modelling: the identification, using statistical models, of “topic terms” that appear across a set of documents.

You can find a gist containing a notebook that summarises the code here.

Simple named entity recognition

spaCy is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.

According to the spaCy entity recognition documentation, the built in model recognises the following types of entity:

  • PERSON People, including fictional.
  • NORP Nationalities or religious or political groups.
  • FACILITY Buildings, airports, highways, bridges, etc.
  • ORG Companies, agencies, institutions, etc.
  • GPE Countries, cities, states. (That is, Geo-Political Entitites)
  • LOC Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT Objects, vehicles, foods, etc. (Not services.)
  • EVENT Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART Titles of books, songs, etc.
  • LANGUAGE Any named language.
  • LAW A legislation related entity(?)

Quantities are also recognised:

  • DATE Absolute or relative dates or periods.
  • TIME Times smaller than a day.
  • PERCENT Percentage, including “%”.
  • MONEY Monetary values, including unit.
  • QUANTITY Measurements, as of weight or distance.
  • ORDINAL “first”, “second”, etc.
  • CARDINAL Numerals that do not fall under another type.

Custom models can also be trained, but this requires annotated training documents.

#!pip3 install spacy
from spacy.en import English
parser = English()
example='''
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.
'''
#Code "borrowed" from somewhere?!
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    tags={}
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
        else:
            tags[term].append((entity.label, entity.label_))
    print(tags)
entities(example)
-------------- entities only ---------------
{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}
q= "Bob Smith was in the Houses of Parliament the other day"
entities(q)
-------------- entities only ---------------
{'Bob Smith': [(377, 'PERSON')]}

Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.

entities(q.lower())
-------------- entities only ---------------
{}

polyglot

A simplistic, and quite slow, tagger, supporting limited recognition of Locations (I-LOC), Organizations (I-ORG) and Persons (I-PER).

#!pip3 install polyglot

##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
##Linux
#apt-get install libicu-dev
!polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
from polyglot.text import Text

text = Text(example)
text.entities
[I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Halifax']),
 I-LOC(['Girvan']),
 I-LOC(['Poland']),
 I-PER(['Nestlé']),
 I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Newcastle']),
 I-ORG(['EU']),
 I-ORG(['EU']),
 I-ORG(['Government']),
 I-ORG(['GMB']),
 I-LOC(['Nestlé'])]
Text(q).entities
[I-PER(['Bob', 'Smith'])]

Partial Matching Specific Entities

Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs’ names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.

To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.

Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.

## The following recipe was hinted at via @pudo

#!pip3 install pyahocorasick
#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py

First, construct an automaton that identifies the terms you want to detect in the target text.

from ahocorasick import Automaton

A=Automaton()
A.add_word("Europe",('VOCAB','Europe'))
A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson (LC)'))

A.make_automaton()
q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
    print(item, q2[:item[0]+1])
(4, ('PERSON', 'Boris Johnson')) Boris
(12, ('PERSON', 'Boris Johnson')) Boris Johnson
(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe
(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe
(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union

Once again, case is important.

q2l = q2.lower()
for item in A.iter(q2l):
    print(item, q2l[:item[0]+1])
(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson

We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:

A=Automaton()
A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))

A.make_automaton()
for item in A.iter(q2):
    start=item[0]-item[1][0][1]+1
    end=item[0]+1
    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo
(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we
(31, (('VOCAB', 6), 'Europe')) to *Europe* to
(60, (('VOCAB', 6), 'Europe')) he *Europe*an 
(68, (('VOCAB', 14), 'European Union')) he *European Union*

Fuzzy String Matching

Whilst the Aho-Corasick approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that almost match terms in specific set of terms.

Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.

fuzzyset

The python fuzzyset package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.

For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.

A confidence value expresses the degree of match to terms in the fuzzy match set list.

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
([(0.8666666666666667, 'Boris Johnson')],
 [(0.8333333333333334, 'Diane Abbott')],
 [(0.23076923076923073, 'Diane Abbott')])

fuzzywuzzy

If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy library. Once again, we spcify a list of target items we want to try to match against.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']

q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
process.extract(q,terms)
[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]

By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:

process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
[('Houses of Parliament', 90), ('Boris Johnson', 85)]

A range of fuzzy match scroing algorithms are supported:

  • WRatio – measure of the sequences’ similarity between 0 and 100, using different algorithms
  • QRatio – Quick ratio comparison between two strings
  • UWRatio – a measure of the sequences’ similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode
  • UQRatio – Unicode quick ratio
  • ratio
  • `partial_ratio – ratio of the most similar substring as a number between 0 and 100
  • token_sort_ratio – a measure of the sequences’ similarity between 0 and 100 but sorting the token before comparing
  • partial_token_set_ratio
  • partial_token_sort_ratio – ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing

More usefully, perhaps, is to return items that match above a particular confidence level:

process.extractBests(q,terms,score_cutoff=90)
[('Houses of Parliament', 90), ('Diane Abbott', 90)]

However, one problem with the fuzzywuzzy matcher is that it doesn’t tell us where in the supplied text string the match occurred, or what string in the text was matched.

The fuzzywuzzy package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)

names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
process.dedupe(names, threshold=80)
['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']

It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:

import hashlib

clusters={}
fuzzed=[]
for t in names:
    fuzzyset=process.extractBests(t,names,score_cutoff=85)
    #Generate a key based on the sorted members of the set
    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
    keytxt=''.join(keyvals)
    key=hashlib.md5(keytxt).hexdigest()
    if len(keyvals)>1 and key not in fuzzed:
        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
        fuzzed.append(key)
for cluster in clusters:
    print(clusters[cluster])
[('Diane Abbott', 100), ('Diana Abbot', 87)]
[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]

OpenRefine Clustering

As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the refine-client.py client libary.

In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.

#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git
#NOTE - this requires a python 2 kernel
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
    for t in ['Name']+names+['Boris Johnstone']:
        f.write(t+ '\n')
!cat {project_file}
Name
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone
p=orefine.new_project(project_file=project_file)
p.columns
[u'Name']

OpenRefine supports a range of clustering functions:

- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')
for cluster in clusters:
    print(cluster)
[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]
[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]

Topic Models

Topic models are statistical models that attempts to categorise different “topics” that occur across a set of docments.

Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents.

gensim

#!pip3 install gensim
#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

def print_top_terms(lda, num_terms=10):
    txt=[]
    num_terms=min([num_terms,lda.num_topics])
    for i in range(0, num_terms):
        terms = [term for term,val in lda.show_topic(i,num_terms)]
        txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
    return '\n'.join(txt)

To start with, let’s create a list of dummy documents and then generate word lists for each document.

docs=['The banks still have a lot to answer for the financial crisis.',
     'This MP and that Member of Parliament were both active in the debate.',
     'The companies that work in finance need to be responsible.',
     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
     'Corporate finance is a big responsibility.']

#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]

#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]

Now we can generate the topic models from the list of word lists.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: parliament debate active
     - top 3 terms for topic #1: responsible work need
     - top 3 terms for topic #2: corporate big responsibility

The model is randomised – if we run it again we are likely to get a different result.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: finance corporate responsibility
     - top 3 terms for topic #1: participants quality high
     - top 3 terms for topic #2: member mp active

Python “Natural Time Periods” Package

Getting on for a year ago, I posted a recipe for generating “Natural Language” Time Periods in Python. At the time, a couple of people asked whether it was packaged – and it wasn’t…

It’s still not on pip, but I have since made a package repo for it: python-natural-time-periods.

Install using: pip3 install --force-reinstall --no-deps --upgrade git+https://github.com/psychemedia/python-natural-time-periods.git

The following period functions are defined:

  • today(), yesterday(), tomorrow()
  • last_week(), this_week(), next_week(), later_this_week(), earlier_this_week()
  • last_month(), next_month(), this_month(), earlier_this_month(), later_this_month()
  • day_lastweek(), day_thisweek(), day_nextweek()

Here’s an example of how to use it:

import natural_time_periods as ntpd

ntpd.today()
>>> datetime.date(2017, 8, 9)

ntpd.last_week()
>>> (datetime.date(2017, 7, 31), datetime.date(2017, 8, 6))

ntpd.later_this_month()
>>> (datetime.date(2017, 8, 10), datetime.date(2017, 8, 31))

ntpd.day_lastweek(ntpd.MON)
>>> datetime.date(2017, 7, 31)

ntpd.day_lastweek(ntpd.MON, iso=True)
>>> '2017-07-31'

NHS Digital Organisation Data Service (ODS) Python / Pandas Data Loader

One of the nice things about the Python pandas data analysis package is that there is a small – but growing – amount of support for downloading data from third party sources directly as a pandas dataframe using the pandas-datareader.

So I thought I’d have a go at producing a package inspired by the pandas wrapper for the World Bank indicators API for downloading administrative data from the NHS Digital Organisation Data Service. You can find the package here: python-pandas-datareader-NHSDigital.

There’s also examples of how to use the package here: python-pandas-datareader-NHSDigital demo notebook.

When Identifiers Don’t Align – NHS Digital GP Practice Codes and CQC Location IDs

One of the nice things about NHS Digital datasets is that there is a consistent use of identifier codes across multiple datasets. For example, GP Practice Codes are used to index particular GP practices across multiple datasets listed on both the GP and GP practice related data and General Practice Data Hub directory pages.

Information about GPs is also recorded by the CQC, who publish quality ratings across a wide range of health and social care providers. One of the nice things about the CQC data is that it also contains information about corporate groupings (and Companies House company numbers) and “Brands” with which a particular location is associated, which means you can start to explore the make up of the larger commercial providers.

Unfortunately, the identifier scheme used by the CQC is not the same as the once used by NHS Digital. This wouldn’t provide much of a hurdle if a lookup table was available that mapped the codes for GP practices rated by the CQC against the NHS Digital codes, but such a lookup table doesn’t appear to exist – or at least, is not easily discoverable.

So if we do want to join the CQC and NHS Digital datasets, what are we to do?

One approach is to look for common cribs across both datasets to bring them into partial alignment, and then try to do some  do exact matching within nearly aligned sets. For example, both datasets include postcode data, so if we match on postcode, we can then try to find a higher level of agreement by trying to exactly match location names sharing the same postcode.

This gets us so far, but exact string matching is likely to return a high degree of false negatives (i.e. unmatched items that should be matched). For example, it’s easy enough for us to assume that THE LINTHORPE SURGERY and LINTHORPE SURGERY  are the same, but they aren’t exact matches. We could improve the likelihood of matching by removing common stopwords and stopwords sensitive to this domain – THE, for example, or “CENTRE”, but using partial or fuzzy matching techniques are likely to work better still, albeit with the risk of now introducing false positive matches (that is, strings that are identified as matching at a particular confidence level but that we would probable rule out as a match, for example HIRSEL MEDICAL CENTRE and KINGS MEDICAL CENTRE.

Anyway, here’s a quick sketch of how we might start to go about reconciling the datasets – comments appreciated about how to improve it further either here or in the repo issues: CQC and NHS Code Reconciliation.ipynb