Not Quite Serverless Reproducible Environments With Digital Ocean

I’ve been on something of a code rollercoaster over the last few days, fumbling with Jupyter server proxy settings in MyBinder, and fighting with OpenRefine, but I think I’ve stumbled on a new pattern that could be quite handy. In fact, I think it’s frickin’ ace, even though it’s just another riff, another iteration, on what we’ve been able to do before (e.g. along the lines of deploy to tutum, Heroku or zeit.now).

(It’s probably nothing that wasn’t out there already, but I have seen the light that issues forth from it.)

One of the great things about MyBinder is that it helps make you work reproducible. There’s a good, practical, review of why in this presentation: Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud. One of the presentation claims is that it costs about $5000 a month to run MyBinder, covered at the moment by the project, but following the reference link I couldn’t see that number anywhere. What I did see, though, was something worth bearing in mind: “[MyBinder] users are guaranteed at least 1GB of RAM, with a maximum of 2GB”.

For OpenRefine running in MyBinder, along with other services, this shows why it may struggle at times…

So, how can we help?

And how can we get around the fact of not knowing what other stuff repo2docker, the build agent for MyBinder, might be putting into the server, or not being able to use Docker compose to link across several services in the Binderised environment, or having to run MyBinder containers in public (although it looks as though evil Binder auth in general may now be available?)

One way would be for institutions to chip in the readies to help keep the public MyBinder service free. Another way could be a sprint on a federated Binderhub in which institutions could chip in server resource. Another would be for institutions to host their own Binderhub instances, either publicly available or just available to registered users. (In the latter case, it would be good if the institutions also contributed developer effort, code, docs or community development back to the Jupyter project as a whole.)

Alternatively, we can roll our own server. But setting up a Binderhub instance is not necessarily the easiest of tasks (I’ve yet to try it…) and isn’t really the sort of thing your average postgrad or data journalist who wants to run a Binderised environment should be expected to have to do.

So what’s the alternative?

To my mind, Binderhub offers a serverless sort of experience, though that’s not to say no servers are involved. The model is that I can pick a repo, click a button, a server is started, my image built and a container launched, and I get to use the resulting environment. Find repo. Click button. Use environment. The servery stuff in the middle — the provisioning of a physical web server and the building of the environment — that’s nothing I have to worry about. It’s seamless and serverless.

Another thing to note is that MyBinder use cases are temporary / ephemeral. Launch Binderised app. Use it. Kill it.

This stands in contrast to setting services running for extended periods of time, and managing them over that period, which is probably what you’d want to do if you ran your own Binderhub instance. I’m not really interested in that. I want to: launch my environment; use it; kill it. (I keep trying to find a way of explaining this “personal application server” position clearly, but not with much success so far…)

So here’s where I’m at now: a nearly, or not quite, serverless solution; a bring your own server approach, in fact, using Digital Ocean, which is the easiest cloud hosting solution I’ve found for the sorts of things I want to do.

Its based around the User data text box in a Digital Ocean droplet creation page:

If you pop a shell script in there, it will run the code that appears in that box once the server is started.

But what code?

That’s the pattern I’ve started exploring.

Something like this:

#!/bin/bash

#Optionally:
export JUPYTER_TOKEN=myOwnPA5%w0rD

#Optionally:
#export REFINEVERSION=2.8

GIST=d67e7de29a2d012183681778662ef4b6
git clone https://gist.github.com/$GIST.git
cd $GIST
docker-compose up -d

which will grab a script (saved as a set of files in a public gist) to download and install an OpenRefine server inside a token protected Jupyter notebook server (the OpenRefine server runs via a Jupyter server proxy (see also OpenRefine Running in MyBinder, Several Ways… for various ways of running OpenRefine behind a Jupyter server proxy in MyBinder).

Or this (original gist):

#!/bin/bash

#Optionally:
#export JUPYTER_TOKEN=myOwnPA5%w0rD

GIST=8fa117e34c62b7f80b6c595b8ba4f488

git clone https://gist.github.com/$GIST.git
cd $GIST

docker-compose up -d

that will download and install a docker-compose set of elements:

  • a Jupyter notebook server, seeded with a demo notebook and various SQL magics;
  • a Postgres server (empty; I really need to add a fragment showing how to seed it with data; or you should be able to figure it out from here: Running a PostgreSQL Server in a MyBinder Container);
  • an AgensGraph server; AgensGraph is a graph database built on Postgres. The demo notebook currently uses the first part of the AgensGraph tutorial to show how to load data into it.

(The latter example includes a zip file that you can’t upload via the Gist web interface; so here’s a recipe for adding binary files (zip files, image files) to a Github gist.)

So what do you need to do to get the above environments up and running?

  • go to Digital Ocean](https://www.digitalocean.com/) (this link will get you $100 credit if you need to create an account);
  • create a droplet;
  • select ‘one-click’ type, and the docker flavour;
  • select a server size (get the cheap 3GB server and the demos will be fine);
  • select a region (or don’t); I generally go for London cos I figure it’s locallest;
  • check the User data check box and paste in one of the above recipes (make sure you start it with no spare lines at the top with the hashbang (#!/bin/bash);
  • optionally name the image (for your convenience and lack of admin panel eyesoreness);
  • click create;
  • copy the IP address of the server that’s created;
  • after 3 or 4 minutes (it may take some time to download the app containers into the server), paste the IP address into a browser location bar;
  • when presented with the Jupyter notebook login page, enter the default token (letmein; or the token you added in the User data script, if you did), or use it to set a different password at the bottom of the login page;
  • use your apps…
  • destroy the droplet (so you don’t continue to pay for it).

If that sounds too hard / too many steps, there are some pictures to show you what to do in the Creating a Digital Ocean Docker Droplet section of this post.

It’s really not that hard…

Though it could be easier. For example, if we had a “deploy to Digital Ocean” button that took the something like the form: http://deploygist.digitalocean.com/GIST and that looked for user_data and maybe other metadata files (region, server size, etc) to set a server running on your account and then redirect you to the appropriate webpage.

We don’t need to rely on just web clients either. For example, here’s a recipe for Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code.

The next thing I need to think about is preserving state. This looks like it may be promising in that regard? Filestash [docs]. This might also be worth looking at: Pydio. (Or this: Grpahite h/t @simonperry.)

For anyone still still reading and still interested, here are some more reasons why I think this post is a useful one…

The linked gists are bother based around Docker deployments (it makes sense to use Docker, I think, because a lot of hard to install software is already packaged in Docker containers), although they demonstrate different techniques:

  • the first (OpenRefine) demo extends a Jupyter container so that it includes the OpenRefine server; OpenRefine then hides behind the Jupyter notebook auth and is proxied using Jupyter server proxy;
  • the second (AgensGraph) demo uses Docker compose. The database services are headless and not exposed.

What I have tried to do in the docker-compose.yml and Dockerfile files is show a variety of techniques for getting stuff done. I’ll comment them more liberally, or write a post about them, when I get chance. One thing I still need to do is a demo using nginx as a reverse proxy, with and without simple http auth. One thing I’m not sure how to do, if indeed it’s doable, is proxy services from a separate container using Jupyter server proxy; nginx would provide a way around that.

Adding Zip Files to Github Gists

Over the years, I’ve regularly used Github gists as a convenient place to post code fragments. Using the web UI, it’s easy enough to create new text files. But how do you add images, or zip files to a gist..?

…because there’s no way I can see of doing it using the web UI?

But a gist is just a git repo, so we should be able to commit binary files to it.

Ish via this gist — How to add an image to a gist — this simple recipe. It requires that you have git installed on your computer…

#YOURGISTID is in the gist URL: https://gist.github.com/USERNAME/GISTID
GIST=GISTID

git clone https://gist.github.com/$GIST.git

#MYBINARYFILENAME is something like mydata.zip or myimage.zip
cp MYBINARYPATH/MYBINARYFILENAME $GIST/

cd $GIST
git add MYBINARYFILENAME

git commit -m "MY COMMIT MESSAGE"

git push origin master
#If prompted, provide your Github credentials associated with the gist

Handy…

Running OpenRefine in the Clear on Digital Ocean

In a couple of earlier posts, I’ve described how to get OpenRefine up and running remotely over the web by installing the OpenRefine server onto a Digital Ocean Linux server and running it there behind a simple authenticating proxy(Running OpenRefine On Digital Ocean Using Simple Auth and the more automated Authenticated OpenRefine Server on Digital Ocean, Redux).

In this post I’ll show how to set up a simple OpenRefine server, without authentication, using Docker (I’ll show how to add in the authenicating nginx proxy in a follow on post).

Docker is a virtualisation technology that heavily draws on the idea of “containers”, isolated computational environments that provide just enough operating system to run a particular application within them.

As well as hosting raw Linux servers, Digital Ocean also provides Linux-servers-with-docker as a one-click application.

Here’s how to start a docker machine on Digital Ocean.

Creating a Digital Ocean Docker Droplet

First up, create a new droplet as a one-click app, selecting docker as the one-click application type:

To give ourselves some space to work with, I’m going to choose the 3GB server (it may work with default settings in a 2GB server, or it may ruin your day…). It’s metered by the hour, but it’ll still only cost a few pennies for a quick demo. (You can also get $100 free credit as a new user if you sign up here.)

DigitalOcean_-_Create_Droplets_

Select a data center region (I typically go for a local one):

If you want to, add your SSH key (recipe here, but it’s not really necessary: the ssh key just makes it easier for you to login to the server from your own computer if you need to. If you haven’t heard of ssh keys before, ignore this step!)

Hit the big green button to create your droplet (if you want to, give the sever a nicer hostname first…).

Accessing the Digital Ocean Droplet Server Terminal

Your one-click docker server will now start up. Once its there (it should take less than a minute) click through to its admin page. Assuming you haven’t added ssh keys, you’ll need to log in through the console. The login details for your server should have been emailed to the email address associated with your Digital Ocean account. Use them to login.

On first login, you’ll be prompted to change the password (it was emailed to you in plain text after all!)

If you choose a really simple replacement password, you may need to choose another one. Also note that the (current) UNIX password was the one you were emailed, so you’ll essentially be providing this password twice in quick succession (once for the first login, then again to authorise the enforced password change). Copy and pasting the password into the console from your email should work…

Once you’ve changed your password, you’ll be logged out and you’ll have to log back in again with your new password. (Isn’t security a faff?! That’s why ssh keys…!;-)

Now you get to install and launch OpenRefine. I’ve got an example image here, and the recipe for creating it here, but you don’t need really need to look at that if you trust me…

What you do need to do is run:

docker run -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

What this command does is download and run the container psychemedia/openrefinedemo, naming it (purely for our convenience) as openrefine.

You can learn how to create an OpenRefine docker image here: How to Create a Simple Dockerfile for Building an OpenRefine Docker Image.

The -d flags runs the container in “detached”, standalone mode (in the background, essentially). The -p 3333:3333 is read as -p PUBLICPORT:INTERNALPORT. The OpenRefine server is started on INTERNALPORT=3333 and we’re also going to view it on a URL port 3333.

The container will take a few seconds to download if this is the first time you’ve called for it:

and then it’ll a print out a long id number once it’s launched and running the background.

(You can check it’s running by running the command docker ps.)

In both the terminal and the droplet admin pages, as well as the droplet status line in the current droplet listing pages, you should see the public IP address associated with the droplet. Copy that address into your browser and add the port mapping (:3333). You should now be able to see a running version of OpenRefine. (And so should anyone else who wanders by that URL:PORT combination…)

Let’s now move the application to another port. We could do this by launching another container, with a new unique name (container names, when we assign them, need to be unique) and assigned to another port. (The OpenRefine internal service port will remain the same). For example:

docker run -d -p 3334:3333 --name openrefine2 psychemedia/openrefinedemo

This creates a new container running a fresh instance of the OpenRefine server. You should see it on IPADDRESS:3334.

(Alternatively we can omit the name and a random one will be assigned, for example, docker run -d -p 3335:3333)

Note that the docker image does not need to be downloaded again. We simply reuse the one we downloaded previously, and spawn a new instance of it as a new container.

Each container does take up memory though, so kill the original container:

docker kill openrefine

and remove it:

docker rm openrefine.

For a last quick demo, let’s create a new instance of the contain, once again called openrefine (assuming we’ve removed the one previously called that) and run it on port 80, the default http port, which means we should be able to see it directly by going to just the IPADDRESS (with no port specified) in our browser:

docker run -d -p 80:3333 --name openrefine psychemedia/openrefinedemo

When you’re done, you can halt the droplet (in which case, you’ll keep on paying rent for it) or destroy it (which means you won’t be billed for any additional hours, or parts thereof, on top of the time you’ve already been running the droplet):

You don’t need to tidy up around the docker containers, they’ll die with the droplet.

So, not all that hard, is it? Probably a darn sight easier than trying to get anything out of your local IT unit?!

In the next post, I’ll show how to combine the container with another one containing nginx to provide some simple authentication. (There are lots of prebuilt containers out there we can just take “off-the-shelf”, and nginx is one of them.) I’ll maybe also have a look at how you might persist projects in hibernating container / droplet, perhaps look at how we might be able to upload files that OpenRefine can work on, and maybe even try to figure out a way to simply synch your project files from Digital Ocean to your own file storage location somewhere. Maybe…

PS third party nginx proxy example: https://github.com/beevelop/docker-nginx-basic-auth

Converting Pandas Generated HTML Data Tables to PNG Images

Over the weekend, I noticed the Dakar 2019 rally was on, which resulted in my spending Sunday evening putting a scraper together to grab timing data down from the official website (notebook code here).

The data on its own is all a bit “so  what?”; it only comes alive when you start playing with it. One of the displays I was still tinkering with at the end of last year’s WRC season was a tabular stage report that tries to capture a chunk of information from stage timing, such as split times, so it made sense to start riffing on that.

The Rally Dakar timing screen presents split times like this:

You can get a view with either of the running time in stage at each split / waypoint, or the gap, which is to say, the time difference to the person who was fastest at each split. (I think sometimes Gap times may report the time difference to the person who ranked first on the stage overall, rather than the gap to the person ranked first at a particular split.)

One of the things that interests me (from a data storytelling point of view) is how much time a driver gains, or loses, within a split compared to other drivers. We can use this to spot parts of the stage where a driver has hit a problem, or pushed hard.

The sort of display I’ve been working up looks, at least with the Dakar data, like this so far (there are a few columns missing compared to my WRC tables, but there’s also an extra one: the in-line bimodal sparkline chart).

This particular view displays split times rebased relative to Peterhansel (it’s easy enough to generate views rebased relative to any other specified driver). That is, the table shows how much time Peterhansel gained/lost relative to each other driver at each waypoint. The table is ordered by stage rank. The columns on the left show how much time was gained/lost going from one waypoint to the next. The columns on the right show how the gap relative to each driver evolved over the stage. The inline chart tracks the gap evolution.

The table is a styled pandas table, rendered as HTML. After applying styling, you can get a preview in a notebook using something of the form:

from IPython.display import display, HTML
display( HTML( df.style.render() ) )

I’ve previously posted a recipe for Grabbing Screenshots of folium Produced Choropleth Leaflet Maps from Python Code Using Selenium so here’s the latest iteration of my code fragment (which built on the previous example) for taking a chunk of HTML and using selenium to open it in a browser and grab a screenshot of it.

The code is h/t to several Stack Overflow posts.

import os
import time
from selenium import webdriver

#Via https://stackoverflow.com/a/52572919/454773
def setup_screenshot(driver,path):
    ''' Grab screenshot of browser rendered HTML.
        Ensure the browser is sized to display all the HTML content. '''
    # Ref: https://stackoverflow.com/a/52572919/
    original_size = driver.get_window_size()
    required_width = driver.execute_script('return document.body.parentNode.scrollWidth')
    required_height = driver.execute_script('return document.body.parentNode.scrollHeight')
    driver.set_window_size(required_width, required_height)
    # driver.save_screenshot(path)  # has scrollbar
    driver.find_element_by_tag_name('body').screenshot(path)  # avoids scrollbar
    driver.set_window_size(original_size['width'], original_size['height'])

def getTableImage(url, fn='dummy_table', basepath='.', path='.', delay=5, height=420, width=800):
    ''' Render HTML file in browser and grab a screenshot. '''
    browser = webdriver.Chrome()

    browser.get(url)
    #Give the html some time to load
    time.sleep(delay)
    imgpath='{}/{}.png'.format(path,fn)
    imgfn = '{}/{}'.format(basepath, imgpath)
    imgfile = '{}/{}'.format(os.getcwd(),imgfn)

    setup_screenshot(browser,imgfile)
    browser.quit()
    os.remove(imgfile.replace('.png','.html'))
    #print(imgfn)
    return imgpath

def getTablePNG(tablehtml, basepath='.', path='testpng', fnstub='testhtml'):
    ''' Save HTML table as: {basepath}/{path}/{fnstub}.png '''
    if not os.path.exists(path):
        os.makedirs('{}/{}'.format(basepath, path))
    fn='{cwd}/{basepath}/{path}/{fn}.html'.format(cwd=os.getcwd(), basepath=basepath, path=path,fn=fnstub)
    tmpurl='file://{fn}'.format(fn=fn)
    with open(fn, 'w') as out:
        out.write(tablehtml)
    return getTableImage(tmpurl, fnstub, basepath, path)

#call as: getTablePNG(s)
#where s is a string containing html, eg s = df.style.render()

The png image is saved as an image file that can be embedded in other HTML pages, shared via soshul meeja, etc…

PS here’s what looks to be another route (though I haven’t tried it yet..) using imgkit with xfvb in a headless server environment: https://github.com/fomightez/dataframe2img

Authenticated OpenRefine Server on Digital Ocean, Redux

Following on from yesterday’s recipe showing how to Running OpenRefine On Digital Ocean Under Simple Auth, here’s an even easier way that doesn’t require ssh and doesn’t require console access: just copy and paste some set-up info into a form on the Digital Ocean droplet creation page.

Every little everything helps… this link will set new users up with $100 of free credit on Digital Ocean; somewhere down the line I may get a small amount of affiliate link powered Digital Ocean credit to help keep my own server costs covered…

The Digital Ocean droplet creator has an option to add a User data start-up script when the droplet is created (docs).

This means we can provide a script that will download and install everything we need, including a file to define a service that will run OpenRefine.

Copy and paste the script in the gist that can be found here into the user data area before you create the droplet. The code will be executed once the droplet is up and running and should install the nginx proxy and the OpenRefine application. A default user (test) and password (letmein) are defined in script.

If you are trusting of the gist, you can install it more succinctly from the gist repository. You can also define your own user and password. To take this installation route, use the code that follows below in the userdata area:

Here’s the code…:

#!/bin/bash

#We can over-ride the default credentials
USER_NAME=testuser
USER_PWD=testpwd

#Get the latest version of installer script and run it

source <( curl -s https://gist.githubusercontent.com/psychemedia/f256960c112347dd410c2beec8ce05e3/raw/ )

(There should be no spaces between the less than and open bracket characters in the source command.)

For an explicit link to a particlar version of the script, go to the gist, and copy the link for the raw version of the latest file or the version you want.

Note that this route requires you to place considerable trust in the publisher of the remote installation script. Do you really trust a file with such a dodgy filename?

It will probably take two or three minutes to download and install everything, so don’t be too worried if you don’t see anything at the provided IP address immediately. Just keep refreshing the page…

The first sign of life you should see is the nginx default page…

Wait a few moments and reload the page… Next up is a holding page as the OpenRefine application is installed and the service started…

This page should automatically refresh itself every 5 seconds or so. When the service is up and running, you should get a page like this:

Click the link and you should be prompted with the the user/password authenticator challenge. Provide the username/password and you should be taken to your OpenRefine server.

Running OpenRefine On Digital Ocean Using Simple Auth

As a cloud host, Digital Ocean provides a really easy way in to getting services up and running on the web.

Here’s a quick recipe for getting Open Refine up and running behind a simple authentication scheme.

Creating a Server With Digital Ocean

First up, create a Digital Ocean account if you don’t already have one (this link will get you started with $100 free credit).

Creating and launching a server is easy… Select Create then Droplet, choose the server type you want — let’s use a simple Ubuntu box — and choose the server size. For lots of quick tests, I use the smallest box, but from experience I think OpenRefine prefers MORE THAN 2GB or more

Really… If you pick a 2GB server, you may find that OpenRefine hangs on start and ruins your day when you end up trying to debug what you think other other problems…. Be warned, kids… stay unfrustrated out there…

The servers are charged at a metered rate, and you can stop them any time, so for a quick test, it’ll cost you pennies… (A $100 credit can go a long way…!)

Next, choose a data center region; I generally pick a local one…

You also have the option of adding an ssh key. This makes life much easier when trying to log in to the server from your own machine using ssh (you can just run ssh root@IP_ADDRESS and it’ll log you straight in; there’s a recipe for setting up an SSH key here).

If you don’t want to set up an SSH, a root password will be emailed to you. You can use this password to log in to your server via a web terminal, which means you can do everything via a web UI if you need to…

Create your server by clicking the big green button…

It should only take a few seconds to start up… And when it has, you’ll be presented with it’s public IP address.

If you need a web terminal, click through the on the server name, and you should see a link to launch a web console.

Installing  OpenRefine

From the console, we can install all we need to run OpenRefine. This is a minimum viable example — we should probably find a better place to install OpenRefine, and may want to run it as a particular user with limited permissions. Working as root with everything wide open makes life easier, though not necessarily safer…!

OpenRefine requires a Java environment, so we need to install that:

apt-get update && apt-get install -y openjdk-8-jre

We can download the OpenRefine application via the command-line using a command of the form wget -q -O DOWNLOADED_FILE_NAME URL; the download link for each release can be found on the OpenRefine releases page:

wget -q -O openrefine-2.8.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/2.8/openrefine-linux-2.8.tar.gz

The downloaded file is provided as a tar archive file, which we need to unpack:

tar xzf openrefine-2.8.tar.gz

This will unbundle the files into the directory ./openrefine-2.8.

Let’s create an alias for a working directory in which to place the OpenRefine project files:

OPENREFINE_DIR="$HOME/openrefine"

And create that directory:

mkdir -p $OPENREFINE_DIR

You should now be able to run OpenRefine in the background using the command:

nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

This will run OpenRefine on port 3333. If you copy the IP address of your Digital Ocean server, and go to http://IP.ADDRESS:3333, you should see OpenRefine running there. (Note that if you’re in Chrome, Google may well tell you that the address is dangerous…)

Adding Simple Authentication

The OpenRefine server is running as a public service on the public web. If you want to add a simple layer of authentication, you can add a web server proxy to the server that will prompt for a password when a new visit is made to the server.

One of the easiest proxies to get up and running is nginx. Let’s install it, along with a simple Apache toolkit that will help us create a simple password:

apt-get install -y nginx apache2-utils

Now create a simple user/password combination. Ever secure, I’ll go with user test and password letmein:

htpasswd -b -c /etc/nginx/.htpasswd test letmein

Now we need to define the proxy. A Digital Ocean tutorial (How To Install Nginx on Ubuntu 18.04) describes how to set up a firewall – I’m selecting the 'Nginx Full' option because I’m working via SSH, but if you’re working in the web terminal, the more restrictive 'Nginx HTTP' may be more appropriate:

sudo ufw allow 'Nginx Full'

If you try loading the OpenRefine server on port 3333, you should now find that it’s blocked: the firewall is only letting web traffic through on port 80.

We now need to open access back up to the OpenRefine server, albeit via a password challenge. The following will create a default nginx configuration file that will expose the OpenRefine service running on port 3333 via default http port 80, mediated by a simple authorisation challenge:

config='''
server {
  listen 80;
  auth_basic Protected...;
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://127.0.0.1:3333;
  }
}
'''

echo "$config" > /etc/nginx/sites-available/default

Restart the nginx proxy to put the new settings into effect:

nginx -s reload

If you now go to http://IPADDRESS you should be presented with a challenge. Enter the credentials you defined, and you should see your OpenRefine server:

Finishing Up

When you’ve finished your session, you can destroy the droplet. This will tear the server down and you won’t be billed for it anymore.

Alternatively, you can switch the droplet off, but keep it in a shutdown state that you can restart in the future.

However, as the above prompt suggests, you will continue to be billed, even if the service is not running, because it is still consuming Digital Ocean resources.

IPython Magic for Controlling the V-REP Robot Simulator from Jupyter notebooks

Whilst exploring how we might be able to use Jupyter notebooks hooked up to the Coppelia Robotics V-REP robot simulator, it struck me that we needed a fair amount of boilerplate stuff to get the simulator loaded with an appropriate scene file and the a connection made to the simulator from the notebook so we could script the robot actions from the notebooks.

My first approach to trying to simplify presentation was to create some “self-documenting” notebooks that could be used to set-up necessary environmental variables and import default classes and functions:

The %run cell magic loads and runs the referenced notebooks, which can also be inspected (and modified) by students.

(To try to minimise the risk of students introducing breaking changes into the imported notebooks, we could also lock the cells as read-only in the notebooks. Whilst this requires an extension to be installed to implement the read-only behaviour, the intention is that we distribute a customised Jupyter notebook environment to students.)

The loadSceneRelativeToClient() function loads the specified scene into the simulator. Note that this scene should contain a robot model. Once the connection to the simulator is made, a robot object can be instantiated using the connection details. The robot class should contain the definitions required to control the robot model in the loaded in scene.

Setting up the connection to the simulator is a bit of a faff, and when code cell execution is stopped we can get an annoying KeyboardInterrupt report:

We can defend against the KeyboardInterrupt by wrapping the code execution in a try/except block:

try:
    with VRep.connect("127.0.0.1", 19997) as api:
        robot = PioneerP3DXL(api)
        while True:
            #do stuff
            pass
except KeyboardInterrupt:
    pass

But it struck me that it would be much nicer to be able to use some magic along the lines of the following, in which we set up the simulator with a scene, identify the robot we want to control, automatically connect to the simulator and then just run the robot control program:

So here’s a first attempt at some IPython cell magic to do that:

from pyrep import VRep
from __future__ import print_function
from IPython.core.magic import (Magics, magics_class, line_magic,
                                cell_magic, line_cell_magic)
import shlex

# The class MUST call this class decorator at creation time
@magics_class
class Vrep_Sim(Magics):

    @cell_magic
    def vrepsim(self, line, cell):
        "V-REP magic"

        #Use shlex.split to handle quoted strings containing a space character
        loadSceneRelativeToClient(shlex.split(line)[0])

        #Get the robot class from the string
        robotclass=eval(shlex.split(line)[1])

        #Handle default IP address and port settings; grab from globals if set
        ip = self.shell.user_ns['vrep_ip'] if 'vrep_ip' in self.shell.user_ns else '127.0.0.1'
        port = self.shell.user_ns['vrep_port'] if 'vrep_port' in self.shell.user_ns else 19997

        #The try/except block exits form a keyboard interrupt cleanly
        try:
            #Create a connection to the simulator
            with VRep.connect(ip, port) as api:
                #Set the robot variable to an instance of the desired robot class
                robot = robotclass(api)
                #Execute the cell code - define robot commands as calls on: robot
                exec(cell)
        except KeyboardInterrupt:
            pass

    #@line_cell_magic
    @line_magic
    def vrep_robot_methods(self, line):
        "Show methods"
        robotclass = eval(line)
        methods = [method for method in dir(robotclass) if not method.startswith('_')]
        print('Methods available in {}:\n\t{}'.format(robotclass.__name__ , '\n\t'.join(methods)))

#Could install as magic separately
ip = get_ipython()
ip.register_magics(Vrep_Sim)

I’ve also added a bit of line magic to display the methods defined on a robot model class:

The tension now is a pedagogical one: for example, should I be providing students with the robot model, or should they be building up the various control functions (.move_forwards(), .turn_left(), etc.) themselves?

I’m also wondering whether I should push the while True: component into the magic? On balance, I think students need to see it in their code block because getting them to think about control loops rather than one shot execution of command statements is something they often don’t get the first, or even second, time round. But for reducing clutter, it’d make for far cleaner cell block code.

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.

Whatever.

I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API

A couple of days ago, I noticed that Chris Newell had shared the code for his ergast Formula 1 motor racing results database that I use heavily for my F1DtataJunkie doodles. The code is a PHP application that pulls data from the ergast database, which is regularly shared as a MySQL database dump. So I raised an issue asking if a docker containerised version of the application was available, and Chris replied that there wasn’t, but if I wanted to look at creating one…?

…which I did. After a few false starts, I came up with the solution Chris has since pulled into his ergast-f1-api repo.

The pattern is quite a handy one, and I think reusable – I’ll give it a go for spinning up my own API to look up ONS Geography codes when I get a chance – so what’s involved?

The dockerised application is built from two components launched using docker-compose, a MySQL container and an Apache server configured with PHP5:

ergastdb:
  container_name: ergastdb
  build: ergastdb/
  environment:
    MYSQL_ROOT_PASSWORD: f1
    MYSQL_DATABASE: ergastdb
  expose:
    - "3306"

web:
  build: ./lamp
  #image: nimmis/apache-php5
  ports:
    - '8000:80'
  volumes:
    - ./webroot:/var/www/html
    - ./php/api:/php/api
    - ./logs:/var/log/apache2
  links:
    - ergastdb:ergastdb

The server runs the application, makes requests from the linked database container, and returns the result.

As part of my f1datajunkie tinkering, I’d put together a Dockerfle for populating a MySQL database with Chris’ database dump some time ago, so I could reuse that directly.

Which meant all I had to do was get the application up and running… Chris’ original instructions around the API server application were to place the application files into “the root directory” and also add in an Apache .htaccess URL rewrite, which he provided.

Simple… or maybe not..?! Not being an Apache user, it took me a bit of time to actually get things up and running, so here are some of the gotchas that caught me out:

  • where’s the “root directory”?
  • where should the .htaccess file?

Running a couple of simple tests identified the root directory as the root for the files served by the webserver, and a quick search revealed the .htaccess file should go in the same location.

But the redirects didn’t work…

As a test, I wrote a simple rewrite rule that should have redirected any url to the same test file – but no joy…

A bit more testing suggested I needed to enable the mod_rewrite plugin, which I did in the appropriate Dockerfile. But still no joy. Because to use patches like the .htaccess file in a server directory also requires allowing “Overrides” to the base Apache settings – which I did (via a crib) by rewriting the Apache config file, again via the Dockerfile that builds the server container image.

RUN sed -i '/<Directory \/var\/www\/>/,/<\/Directory>/ s/AllowOverride None/AllowOverride All/' /etc/apache2/apache2.conf

RUN a2enmod rewrite

Then the database connection didn’t work… But that was easily fixed: I’d forgotten to change the permission is in the appropriate application file.

Now running the docker-compose file with the command docker-compose up --build -d and it fires up two linked containers. The API is then available via a browser on http://localhost:8000/api/f1 using calls of the form http://localhost:8000/api/f1/2015.json.

So now I have a minimal working example of a PHP application powered by a MySQL database driving a simple web API. Which I can crib from for my own simple APIs…

As to how it could be improved, there are a couple of obvious things:

  • the containers are intended for personal, local use: the database is accessed as root and should be properly secured;
  • I haven’t got a pattern for updating the database when a new database image is released. The current workaround would be to destroy the containers, and the database volume, then rebuild the images and the containers.

PS to use the local ergast API server from R with my ergastR package, I needed to tweak the package to allow me to specify the path the API server. I’ll post details of how I did that later…

Simple Text Analysis Using Python – Identifying Named Entities, Tagging, Fuzzy String Matching and Topic Modelling

Text processing is not really my thing, but here’s a round-up of some basic recipes that allow you to get started with some quick’n’dirty tricks for identifying named entities in a document, and tagging entities in documents.

In this post, I’ll briefly review some getting started code for:

  • performing simple entity extraction from a text; for example, when presented with a text document, label it with named entities (people, places, organisations); entity extraction is typically based on statistical models that rely on document features such as correct capitalisation of names to work correctly;
  • tagging documents that contain exact matches of specified terms: in this case, we have a list of specific text strings we are interested in (for example, names of people or companies) and we want to know if there are exact matches in the text document and where those matches occur in the document;
  • partial and fuzzing string matching of specified entities in a text: in this case, we may want to know whether something resembling a specified text string occurs in the document (for example, mis0spellings of name);
  • topic modelling: the identification, using statistical models, of “topic terms” that appear across a set of documents.

You can find a gist containing a notebook that summarises the code here.

Simple named entity recognition

spaCy is a natural language processing library for Python library that includes a basic model capable of recognising (ish!) names of people, places and organisations, as well as dates and financial amounts.

According to the spaCy entity recognition documentation, the built in model recognises the following types of entity:

  • PERSON People, including fictional.
  • NORP Nationalities or religious or political groups.
  • FACILITY Buildings, airports, highways, bridges, etc.
  • ORG Companies, agencies, institutions, etc.
  • GPE Countries, cities, states. (That is, Geo-Political Entitites)
  • LOC Non-GPE locations, mountain ranges, bodies of water.
  • PRODUCT Objects, vehicles, foods, etc. (Not services.)
  • EVENT Named hurricanes, battles, wars, sports events, etc.
  • WORK_OF_ART Titles of books, songs, etc.
  • LANGUAGE Any named language.
  • LAW A legislation related entity(?)

Quantities are also recognised:

  • DATE Absolute or relative dates or periods.
  • TIME Times smaller than a day.
  • PERCENT Percentage, including “%”.
  • MONEY Monetary values, including unit.
  • QUANTITY Measurements, as of weight or distance.
  • ORDINAL “first”, “second”, etc.
  • CARDINAL Numerals that do not fall under another type.

Custom models can also be trained, but this requires annotated training documents.

#!pip3 install spacy
from spacy.en import English
parser = English()
example='''
That this House notes the announcement of 300 redundancies at the Nestlé manufacturing factories in York, Fawdon, Halifax and Girvan and that production of the Blue Riband bar will be transferred to Poland; acknowledges in the first three months of 2017 Nestlé achieved £21 billion in sales, a 0.4 per cent increase over the same period in 2016; further notes 156 of these job losses will be in York, a city that in the last six months has seen 2,000 job losses announced and has become the most inequitable city outside of the South East, and a further 110 jobs from Fawdon, Newcastle; recognises the losses come within a month of triggering Article 50, and as negotiations with the EU on the UK leaving the EU and the UK's future with the EU are commencing; further recognises the cost of importing products, including sugar, cocoa and production machinery, has risen due to the weakness of the pound and the uncertainty over the UK's future relationship with the single market and customs union; and calls on the Government to intervene and work with hon. Members, trades unions GMB and Unite and the company to avert these job losses now and prevent further job losses across Nestlé.
'''
#Code "borrowed" from somewhere?!
def entities(example, show=False):
    if show: print(example)
    parsedEx = parser(example)

    print("-------------- entities only ---------------")
    # if you just want the entities and nothing else, you can do access the parsed examples "ents" property like this:
    ents = list(parsedEx.ents)
    tags={}
    for entity in ents:
        #print(entity.label, entity.label_, ' '.join(t.orth_ for t in entity))
        term=' '.join(t.orth_ for t in entity)
        if ' '.join(term) not in tags:
            tags[term]=[(entity.label, entity.label_)]
        else:
            tags[term].append((entity.label, entity.label_))
    print(tags)
entities(example)
-------------- entities only ---------------
{'House': [(380, 'ORG')], '300': [(393, 'CARDINAL')], 'Nestlé': [(380, 'ORG')], '\n York , Fawdon': [(381, 'GPE')], 'Halifax': [(381, 'GPE')], 'Girvan': [(381, 'GPE')], 'the Blue Riband': [(380, 'ORG')], 'Poland': [(381, 'GPE')], '\n': [(381, 'GPE'), (381, 'GPE')], 'the first three months of 2017': [(387, 'DATE')], '£ 21 billion': [(390, 'MONEY')], '0.4 per': [(390, 'MONEY')], 'the same period in 2016': [(387, 'DATE')], '156': [(393, 'CARDINAL')], 'York': [(381, 'GPE')], '\n the': [(381, 'GPE')], 'six': [(393, 'CARDINAL')], '2,000': [(393, 'CARDINAL')], 'the South East': [(382, 'LOC')], '110': [(393, 'CARDINAL')], 'Fawdon': [(381, 'GPE')], 'Newcastle': [(380, 'ORG')], 'a month of': [(387, 'DATE')], 'Article 50': [(21153, 'LAW')], 'EU': [(380, 'ORG')], 'UK': [(381, 'GPE')], 'GMB': [(380, 'ORG')], 'Unite': [(381, 'GPE')]}
q= "Bob Smith was in the Houses of Parliament the other day"
entities(q)
-------------- entities only ---------------
{'Bob Smith': [(377, 'PERSON')]}

Note that the way that models are trained typically realises on cues from the correct capitalisation of named entities.

entities(q.lower())
-------------- entities only ---------------
{}

polyglot

A simplistic, and quite slow, tagger, supporting limited recognition of Locations (I-LOC), Organizations (I-ORG) and Persons (I-PER).

#!pip3 install polyglot

##Mac ??
#!brew install icu4c
#I found I needed: pip3 install pyicu, pycld2, morfessor
##Linux
#apt-get install libicu-dev
!polyglot download embeddings2.en ner2.en
[polyglot_data] Downloading package embeddings2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
[polyglot_data] Downloading package ner2.en to
[polyglot_data]     /Users/ajh59/polyglot_data...
from polyglot.text import Text

text = Text(example)
text.entities
[I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Halifax']),
 I-LOC(['Girvan']),
 I-LOC(['Poland']),
 I-PER(['Nestlé']),
 I-LOC(['York']),
 I-LOC(['Fawdon']),
 I-LOC(['Newcastle']),
 I-ORG(['EU']),
 I-ORG(['EU']),
 I-ORG(['Government']),
 I-ORG(['GMB']),
 I-LOC(['Nestlé'])]
Text(q).entities
[I-PER(['Bob', 'Smith'])]

Partial Matching Specific Entities

Sometimes we may have a list of entities that we wish to match in a text. For example, suppose we have a list of MPs’ names, or a list of ogranisations of subject terms identified in a thesaurus, and we want to tag a set of documents with those entities if the entity exists in the document.

To do this, we can search a text for strings that exactly match any of the specified terms or where any of the specified terms match part of a longer string in the text.

Naive implementations can take a signifcant time to find multiple strings within a tact, but the Aho-Corasick algorithm will efficiently match a large set of key values within a particular text.

## The following recipe was hinted at via @pudo

#!pip3 install pyahocorasick
#https://github.com/alephdata/aleph/blob/master/aleph/analyze/corasick_entity.py

First, construct an automaton that identifies the terms you want to detect in the target text.

from ahocorasick import Automaton

A=Automaton()
A.add_word("Europe",('VOCAB','Europe'))
A.add_word("European Union",('VOCAB','European Union'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson'))
A.add_word("Boris",('PERSON','Boris Johnson'))
A.add_word("Boris Johnson",('PERSON','Boris Johnson (LC)'))

A.make_automaton()
q2='Boris Johnson went off to Europe to complain about the European Union'
for item in A.iter(q2):
    print(item, q2[:item[0]+1])
(4, ('PERSON', 'Boris Johnson')) Boris
(12, ('PERSON', 'Boris Johnson')) Boris Johnson
(31, ('VOCAB', 'Europe')) Boris Johnson went off to Europe
(60, ('VOCAB', 'Europe')) Boris Johnson went off to Europe to complain about the Europe
(68, ('VOCAB', 'European Union')) Boris Johnson went off to Europe to complain about the European Union

Once again, case is important.

q2l = q2.lower()
for item in A.iter(q2l):
    print(item, q2l[:item[0]+1])
(12, ('PERSON', 'Boris Johnson (LC)')) boris johnson

We can tweak the automata patterns to capture the length of the string match term, so we can annotate the text with matches more exactly:

A=Automaton()
A.add_word("Europe",(('VOCAB', len("Europe")),'Europe'))
A.add_word("European Union",(('VOCAB', len("European Union")),'European Union'))
A.add_word("Boris Johnson",(('PERSON', len("Boris Johnson")),'Boris Johnson'))
A.add_word("Boris",(('PERSON', len("Boris")),'Boris Johnson'))

A.make_automaton()
for item in A.iter(q2):
    start=item[0]-item[1][0][1]+1
    end=item[0]+1
    print(item, '{}*{}*{}'.format(q2[start-3:start],q2[start:end],q2[end:end+3]))
(4, (('PERSON', 5), 'Boris Johnson')) *Boris* Jo
(12, (('PERSON', 13), 'Boris Johnson')) *Boris Johnson* we
(31, (('VOCAB', 6), 'Europe')) to *Europe* to
(60, (('VOCAB', 6), 'Europe')) he *Europe*an 
(68, (('VOCAB', 14), 'European Union')) he *European Union*

Fuzzy String Matching

Whilst the Aho-Corasick approach will return hits for strings in the text that partially match the exact match key terms, sometimes we want to know whether there are terms in a text that almost match terms in specific set of terms.

Imagine a situation where we have managed to extract arbitrary named entities from a text, but they do not match strings in a specified list in an exact or partially exact way. Our next step might be to attempt to further match those entities in a fuzzy way with entities in a specified list.

fuzzyset

The python fuzzyset package will try to match a specified string to similar strings in a list of target strings, returning a single item from a specified target list that best matches the provided term.

For example, if we extract the name Boris Johnstone in a text, we might then try to further match that string, in a fuzzy way, with a list of correctly spelled MP names.

A confidence value expresses the degree of match to terms in the fuzzy match set list.

import fuzzyset

fz = fuzzyset.FuzzySet()
#Create a list of terms we would like to match against in a fuzzy way
for l in ["Diane Abbott", "Boris Johnson"]:
    fz.add(l)

#Now see if our sample term fuzzy matches any of those specified terms
sample_term='Boris Johnstone'
fz.get(sample_term), fz.get('Diana Abbot'), fz.get('Joanna Lumley')
([(0.8666666666666667, 'Boris Johnson')],
 [(0.8333333333333334, 'Diane Abbott')],
 [(0.23076923076923073, 'Diane Abbott')])

fuzzywuzzy

If we want to try to find a fuzzy match for a term within a text, we can use the python fuzzywuzzy library. Once again, we spcify a list of target items we want to try to match against.

from fuzzywuzzy import process
from fuzzywuzzy import fuzz
terms=['Houses of Parliament', 'Diane Abbott', 'Boris Johnson']

q= "Diane Abbott, Theresa May and Boris Johnstone were in the Houses of Parliament the other day"
process.extract(q,terms)
[('Houses of Parliament', 90), ('Diane Abbott', 90), ('Boris Johnson', 86)]

By default, we get match confidence levels for each term in the target match set, although we can limit the response to a maximum number of matches:

process.extract(q,terms,scorer=fuzz.partial_ratio, limit=2)
[('Houses of Parliament', 90), ('Boris Johnson', 85)]

A range of fuzzy match scroing algorithms are supported:

  • WRatio – measure of the sequences’ similarity between 0 and 100, using different algorithms
  • QRatio – Quick ratio comparison between two strings
  • UWRatio – a measure of the sequences’ similarity between 0 and 100, using different algorithms. Same as WRatio but preserving unicode
  • UQRatio – Unicode quick ratio
  • ratio
  • `partial_ratio – ratio of the most similar substring as a number between 0 and 100
  • token_sort_ratio – a measure of the sequences’ similarity between 0 and 100 but sorting the token before comparing
  • partial_token_set_ratio
  • partial_token_sort_ratio – ratio of the most similar substring as a number between 0 and 100 but sorting the token before comparing

More usefully, perhaps, is to return items that match above a particular confidence level:

process.extractBests(q,terms,score_cutoff=90)
[('Houses of Parliament', 90), ('Diane Abbott', 90)]

However, one problem with the fuzzywuzzy matcher is that it doesn’t tell us where in the supplied text string the match occurred, or what string in the text was matched.

The fuzzywuzzy package can also be used to try to deduplicate a list of items, returning the longest item in the duplicate list. (It might be more useful if this is optionally the first item in the original list?)

names=['Diane Abbott', 'Boris Johnson','Boris Johnstone','Diana Abbot', 'Boris Johnston','Joanna Lumley']
process.dedupe(names, threshold=80)
['Joanna Lumley', 'Boris Johnstone', 'Diane Abbott']

It might also be useful to see the candidate strings associated with each deduped item, treating the first item in the list as the canonical one:

import hashlib

clusters={}
fuzzed=[]
for t in names:
    fuzzyset=process.extractBests(t,names,score_cutoff=85)
    #Generate a key based on the sorted members of the set
    keyvals=sorted(set([x[0] for x in fuzzyset]),key=lambda x:names.index(x),reverse=False)
    keytxt=''.join(keyvals)
    key=hashlib.md5(keytxt).hexdigest()
    if len(keyvals)>1 and key not in fuzzed:
        clusters[key]=sorted(set([x for x in fuzzyset]),key=lambda x:names.index(x[0]),reverse=False)
        fuzzed.append(key)
for cluster in clusters:
    print(clusters[cluster])
[('Diane Abbott', 100), ('Diana Abbot', 87)]
[('Boris Johnson', 100), ('Boris Johnstone', 93), ('Boris Johnston', 96)]

OpenRefine Clustering

As well as running as a browser accessed application, OpenRefine also runs as a service that can be accessed from Python using the refine-client.py client libary.

In particular, we can use the OpenRefine service to cluster fuzzily matched items within a list of items.

#!pip install git+https://github.com/PaulMakepeace/refine-client-py.git
#NOTE - this requires a python 2 kernel
#Initialise the connection to the server using default or environment variable defined server settings
#REFINE_HOST = os.environ.get('OPENREFINE_HOST', os.environ.get('GOOGLE_REFINE_HOST', '127.0.0.1'))
#REFINE_PORT = os.environ.get('OPENREFINE_PORT', os.environ.get('GOOGLE_REFINE_PORT', '3333'))
from google.refine import refine, facet
server = refine.RefineServer()
orefine = refine.Refine(server)
#Create an example CSV file to load into a test OpenRefine project
project_file = 'simpledemo.csv'
with open(project_file,'w') as f:
    for t in ['Name']+names+['Boris Johnstone']:
        f.write(t+ '\n')
!cat {project_file}
Name
Diane Abbott
Boris Johnson
Boris Johnstone
Diana Abbot
Boris Johnston
Joanna Lumley
Boris Johnstone
p=orefine.new_project(project_file=project_file)
p.columns
[u'Name']

OpenRefine supports a range of clustering functions:

- clusterer_type: binning; function: fingerprint|metaphone3|cologne-phonetic
- clusterer_type: binning; function: ngram-fingerprint; params: {'ngram-size': INT}
- clusterer_type: knn; function: levenshtein|ppm; params: {'radius': FLOAT,'blocking-ngram-size': INT}
clusters=p.compute_clusters('Name',clusterer_type='binning',function='cologne-phonetic')
for cluster in clusters:
    print(cluster)
[{'count': 1, 'value': u'Diana Abbot'}, {'count': 1, 'value': u'Diane Abbott'}]
[{'count': 2, 'value': u'Boris Johnstone'}, {'count': 1, 'value': u'Boris Johnston'}]

Topic Models

Topic models are statistical models that attempts to categorise different “topics” that occur across a set of docments.

Several python libraries provide a simple interface for the generation of topic models from text contained in multiple documents.

gensim

#!pip3 install gensim
#https://github.com/sgsinclair/alta/blob/e5bc94f7898b3bcaf872069f164bc6534769925b/ipynb/TopicModelling.ipynb
from gensim import corpora, models

def get_lda_from_lists_of_words(lists_of_words, **kwargs):
    dictionary = corpora.Dictionary(lists_of_words) # this dictionary maps terms to integers
    corpus = [dictionary.doc2bow(text) for text in lists_of_words] # create a bag of words from each document
    tfidf = models.TfidfModel(corpus) # this models the significance of words using term frequency inverse document frequency
    corpus_tfidf = tfidf[corpus]
    kwargs["id2word"] = dictionary # set the dictionary
    return models.LdaModel(corpus_tfidf, **kwargs) # do the LDA topic modelling

def print_top_terms(lda, num_terms=10):
    txt=[]
    num_terms=min([num_terms,lda.num_topics])
    for i in range(0, num_terms):
        terms = [term for term,val in lda.show_topic(i,num_terms)]
        txt.append("\t - top {} terms for topic #{}: {}".format(num_terms,i,' '.join(terms)))
    return '\n'.join(txt)

To start with, let’s create a list of dummy documents and then generate word lists for each document.

docs=['The banks still have a lot to answer for the financial crisis.',
     'This MP and that Member of Parliament were both active in the debate.',
     'The companies that work in finance need to be responsible.',
     'There is a reponsibility incumber on all participants for high quality debate in Parliament.',
     'Corporate finance is a big responsibility.']

#Create lists of words from the text in each document
from nltk.tokenize import word_tokenize
docs = [ word_tokenize(doc.lower()) for doc in docs ]

#Remove stop words from the wordlists
from nltk.corpus import stopwords
docs = [ [word for word in doc if word not in stopwords.words('english') ] for doc in docs ]

Now we can generate the topic models from the list of word lists.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: parliament debate active
     - top 3 terms for topic #1: responsible work need
     - top 3 terms for topic #2: corporate big responsibility

The model is randomised – if we run it again we are likely to get a different result.

topicsLda = get_lda_from_lists_of_words([s for s in docs if isinstance(s,list)], num_topics=3, passes=20)
print( print_top_terms(topicsLda))
     - top 3 terms for topic #0: finance corporate responsibility
     - top 3 terms for topic #1: participants quality high
     - top 3 terms for topic #2: member mp active