OpenRefine Running in MyBinder, Several Ways…

Python packages such as the Jupyter Server  Proxy allow you to use a Jupyter notebook server as a proxy for other services running in the same environment, such as a MyBinder container.

The jupyter-server-proxy package represents a generalisation of earlier demonstrations that showed how to proxy RStudio and OpenRefine in a MyBinder container.

In this post, I’ll serve up several ways of getting OpenRefine running in a MyBinder container. You can find all the examples in branches off psychemedia/jupyterserverproxy-openrefine.

Early original work on getting OpenRefine running in MyBinder was done by @betatim (betatim/openrefineder) using an earlier package, nbserverproxy; @yuvipanda helped me get my head round various bits of jupyterhub/jupyter-server-proxy/ which is key to proxying web services via Jupyter (jupyter-server-proxy docs). @manics provided the jupyter-server-proxy PR for handling predefined, rather than allocated, port mappings, which also made life much easier…

Common Installation Requirements

The following steps are pretty much common to all the recipes, and are responsibly for installing OpenRefine and its dependencies.

First, in a binder/apt.txt file, the Java dependency:

openjdk-8-jre

A binder/postBuild step to install a specific version of OpenRefine:

#!/bin/bash
set -e

VERSION=2.8
wget -q -O openrefine-$VERSION.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/$VERSION/openrefine-linux-$VERSION.tar.gz
mkdir -p $HOME/.openrefine
tar xzf openrefine-$VERSION.tar.gz -C $HOME/.openrefine
rm openrefine-$VERSION.tar.gz

mkdir -p $HOME/openrefine

A binder/requirements.txt file to install the Python OpenRefine API client:

git+https://github.com/dbutlerdb/refine-client-py

Note that this a fork of the original client that supports Python 3. It works with OpenRefine 2.8 but I’m not sure if it works properly with OpenRefine 3. There are multiple forks of the client and from what I can tell they are differently broken. It would be great of OpenRefine repo took on one fork as the official client that everyone could contribute to.

start definition – Autostarting headless OpenRefine Server

Binder

This start branch (repo) demonstrates:

  • using a binder/start file to auto start OpenRefine;
  • a notebook/client demo; this essentially runs in a headless mode.

The binder/start file extend the MyBinder start CMD to run the commands included in the file in addition to the default command to start the Jupyter notebook server. (The binder/start file can also be used to run things like the setting of environment variables. I’m not sure how to make available an environment variable defined in binder/postBuild inside binder/start?)

#!/bin/bash

#Start OpenRefine
OPENREFINE_DIR="$HOME/openrefine"
mkdir -p $OPENREFINE_DIR
nohup $HOME/.openrefine/openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

exec "$@"

In this demo, you won’t be able to see the OpenRefine GUI using this demo. Instead, you can access it via its API using an OpenRefine python client. An included notebook gives a worked example (note that at the moment you can’t run the first few parts of the demo because they assume the presence of a pre-existing OpenRefine project. Instructions appear further down the notebook for creating a project and working with it using the API client; I’ll do a separate post on the OpenRefine Python client at some point…)

simpleproxy definition

The simpleproxy branch (repo) extends the start branch with a proxy that can be used to render the OpenRefine GUI.

The binder/requirements.txt needs an additional package — the jupyter-server-proxy package. (I’m using the repo version because at the time of writing the PyPi released version doesn’t include all the features we need…)

git+https://github.com/dbutlerdb/refine-client-py
git+https://github.com/jupyterhub/jupyter-server-proxy

If you launch the Binder, it uses the serverproxy to proxy the OpenRefine port to proxy/3333/; note that the trailing slash is important. Without it, the static files (CSS etc) required to render the page are not resolved correctly.

traitlet-nolab definition

The traitlet-nolab branch (repo) uses the traitlet method (docs) to add a menu option to the Jupyter notebook homepage that allows OpenRefine to be started and launched from the notebook home New menu.

OpenRefine will also be started automatically if you start the MyBinder container with ?urlpath=openrefine or navigate directly to http://MYBINDERURL/openrefine.

Start on Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

In this case, the binder/start invocation is not required.

Once started, OpenRefine will appear on a named proxy path, openrefine (the slash may be omitted in this case).

The traitlet is defined in a jupyter_notebook_config.py file:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'title': 'OpenRefine'
        },
    },
}

This is copied into the correct location by an additional binder/postBuild step:

mkdir -p $HOME/.jupyter/

#Although located in binder/,
# this bash file runs in $HOME rather than $HOME/binder
mv jupyter_notebook_config.py $HOME/.jupyter/

The traitlet definition file is loaded in as a notebook server configuration file prior to starting the notebook server.

Note that the definition file uses the port: 3333 attribute to explicitly set the port that the server will be served against. If this is omitted, then a port will be dynamically allocated by the proxy server. In the case of OpenRefine, I am defining a port explicitly so that the Python API client can connect to it directly on the assumed default port 3333.

Note that if we try to use the Python client without starting the OpenRefine server by launching it, the connection will fail because there will be no running OpenRefine server for the client to connect to.

Python package setup definition

The setup branch (repo) demonstrates:

  • using serverproxy (setup definition (docs)) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assignment once again so that we can work with the client package using default port settings.

Start in Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

For this build, we go back to the base setup (no binder/start, not traitlet definition files) and add a setup.py file:

import setuptools

setuptools.setup(
  name="jupyter-openrefine-server",
  # py_modules rather than packages, since we only have 1 file
  py_modules=['openrefine'],
  entry_points={
      'jupyter_serverproxy_servers': [
          # name = packagename:function_name
          'openrefine = openrefine:setup_openrefine',
      ]
  },
)

This calls on an openrefine.py file to define the configuration:

import os

def setup_openrefine():
  path = os.path.join(os.environ['HOME'], 'openrefine')
  return {
    'command': ['$HOME/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d',path],
    'port': 3333,
    'launcher_entry': {
        'title': 'OpenRefine',
    },
  }

As before, an OpenRefine option is added to the start menu and can be used to start the OpenRefine server and launch the UI client on the path openrefine. (As we started the server on a known port, can also find it explictly at proxy/3333.)

Calling the aliased URL directly will also start the server. This means we can tell MyBinder to open on the openrefine path (or add ?urlpath=openrefine to the Binder URL) and the container will open into the OpenRefine application.

Once again, we need to launch the OpenRefine app before we can connect to it from the Python client.

master branch – traitlet definition, Notebook and JupyterLab Support

The master branch (repo) builds on the traitlet definition branch and demonstrates:

  • using serverproxy (traitlet definition) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assigment so that we can work with the client package.
  • a button is also enabled and added to the JupyterLab launcher.

OpenRefine can now be started and launched from the notebook homepage New menu or from the JupyterLab launcher, via a ?urlpath=openrefine MyBinder luanch invocation, or by navigating directly to the proxied path openrefine.

Open to Notebook homepage: Binder

Open to OpenRefine: Binder

Open to Jupyterlab: Binder

In this case, we need to enable the JupyterLab extension with the following addition to the binder/postBuild file:

#Enable the OpenRefine icon in JuptyerLab desktop launcher
jupyter labextension install jupyterlab-server-proxy

This will enable a default start button in the JupyterLab launcher.

We can also provide an icon for the start button. Further modify the binder/postBuild file to copy the logo to a desired location:

#Although located in binder/,
mv open-refine-logo.svg $HOME/.jupyter/

and modify the jupyter_notebook_config.py by with the addition of a path to the start logo, also ensuring that the launcher entry is enabled:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'enabled': True,
            'icon_path': '/home/jovyan/.jupyter/open-refine-logo.svg',
            'title': 'OpenRefine',
        },
    },
}

We should now see a start button for OpenRefine in the JupyterLab launcher.

Clicking on the button will autostart the server an open a browser tab onto the OpenRefine application GUI.

Summary

Running OpenRefine in MyBinder using JupyterServerProxy allows us to use OpenRefine as part of a shareable, on demand, serverless, Jupyter mediated workbench, defined via a public Github repository.

As well as being access as a GUI application, and via the Python API client, OpenRefine can also connect to a PostgreSQL server running inside the MyBinder container. For running PostgreSQL inside MyBinder, see Running a PostgreSQL Server in a MyBinder Container; for connecting OpenRefine to a Postgres server running in the same MyBinder container, see OpenRefine Database Connections in MyBinder.

OpenRefine Database Connections in MyBinder

With the version 3.0 release of OpenRefine last year, database integration was introduced that allows data to be imported into OpenRefine from a connected database, or exported to a downloadable SQL datadump. (It doesn’t look like you can save/export data to a new database table in the connected database, or upsert the contents of a cleaned table). This was the release of the  OpenRefine Database Import Extension and the SqlDump export mentioned in this earlier post.

If you want to try it out, I’ve created a MyBinder / repo2docker configuration repo that will launch a MyBinder repo containing both a running OpenRefine server and a running PostgreSQL server, although the test table is very small…

For how to run Postgres in a MyBinder container, see Running a PostgreSQL Server in a MyBinder Container.

Start in OpenRefine client: Binder

Details are:

  • host: localhost
  • Port: 5432
  • User: testuser
  • Password: testpass
  • Database: testdb

There’s also a tiny seeded table in the database called quickdemo from which we can import data into OpenRefine:

I said it was a small table!

The rest of the db integration — SQL export — is described in the aforementioned post on OpenRefine’s SQL integration.

I have to admit I’m not sure what the workflow is? You’d typically want to put clean data into a database, rather than pull data from a database into OpenRefine for cleaning.

If you are using OpenRefine as a data cleaning tool, it would be useful to be able to export the data directly back into the connected database, either as an upserted table (as well as perhaps some row deletions) or as a new ..._clean table (“Upsert to database…”).

If you’re using OpenRefine as a data enrichment tool, being able to create a new, enriched table back in the connected database (“Export to database…”) would also make sense.

One of the things I’ll add to the to-do list is an example of how to export data from OpenRefine and then import it into the database using a simple Jupyter notebook script (a Jupyter notebook server is also running in the  MyBinder container (just delete the openrefine/ path from the MyBinder URL).

One of the new (to me) things I’ve spotted in OpenRefine 3 is the ability to export a Project Data Package. I mistakenly thought this might be something like a Frictionless Data data package format, but it looks to just be an export format for the OpenRefine project data? There are fields for import settings as well as descriptive metadata, but I don’t see any dialogues in the UI where you’d enter things like creator, contributors or description?

{
  "name": "clipboard",
  "tags": [],
  "created": "2019-02-09T22:13:42Z",
  "modified": "2019-02-09T22:14:31Z",
  "creator": "",
  "contributors": "",
  "subject": "",
  "description": "",
  "rowCount": 2,
  "title": "",
  "homepage": "",
  "image": "",
  "license": "",
  "version": "",
  "customMetadata": {},
  "importOptionMetadata": [
    {
      "guessCellValueTypes": false,
      "projectTags": [
        ""
      ],
      "ignoreLines": -1,
      "processQuotes": true,
      "fileSource": "(clipboard)",
      "encoding": "",
      "separator": ",",
      "storeBlankCellsAsNulls": true,
      "storeBlankRows": true,
      "skipDataLines": 0,
      "includeFileSources": false,
      "headerLines": 1,
      "limit": -1,
      "quoteCharacter": "\"",
      "projectName": "clipboard"
    }
  ],
  }
}

One of the column operations you can perform inOpenRefine is to cast columns to text, dates or numerics, but I don’t think that is saved as metadata anywhere? You can also define column types in the SQL exporter, but again, I’m not sure that then becomes project metadata. It’d be good to see these things unified a bit, and framing such a process in terms of supporting a tabular data package (with things like column typing specified) could be useful.

Another foil for this might be supporting a SQLite export format?

I have to admit I’m a bit confused as to how OpenRefine sits where in different workflows, particularly with data that is managed, and as such is most likely to be stored in some sort of database? (Lots of the OpenRefine tooling still harkens to a Linked Data future, so maybe it fits better in Linked Data workflows?). I also get the feeling that it shares a possible overlap with query engine tools such as Apache Drill, and maybe even document data extraction tools such as Apache Tika or Tabula. Again, seeing demonstrated toolchains and workflows in this area could be interesting.

Note to self: there are several other PDF table extractor tools out there alongside Tabula (Java) that I haven’t played with; eg R/pdftools, Python/Camelot and Python/pdfplumber.

Simon Willison is doing all sorts of useful stuff framing datasette as a datasette / SQLite ecosystem play. It could be useful to think a bit more about OpenRefine in terms of how it integrates with other data tools. For example, the X-to-sqlite tools help you start to structure variously formatted data sources in terms of a common SQLite representation, which can naturally incorporate things like column typing, but also the notion of database primary and foreign key columns. In a sense, OpenRefine provides a similar “import from anything-export to one format (CSV)” with a data cleaning step in the middle, but CSV is really informally structured in terms of its self-descriptive representation.

One of the insights I had when revising our TM351 relational database notebooks was that database table constraints can play a really useful role when helping clean a dataset by automatically identifying things that are wrong with it… I’ll maybe try to demonstrate an OpenRefine / Jupyter notebook hybrid workflow around that too…

By the by, I noticed this post the other day Exploring the dystopian future of a Javascript Gephi. Gephi, like OpenRefine, is a Java app, and like OpenRefine is one I’ve never been tempted to doodle with code wise for a couple of reasons: a) Java doesn’t appeal to me as a language; b) I don’t have a Java environment to hand, and the thought of trying to set up an environment, and all the build tools, as a novice, for a complex legacy project just leaves me cold. As the Gephi developers see it, “[w]e have to face it: the multiplatform is moving from Java to web technologies. Oracle wants a Java that powers backends, not a user interface framework.”

I’ve dabbled with OpenRefine off and on for years now, and whiles its browser accessibility is really handy, the docs could do with some attention (I guess that’s something I could make a positive contribution to). Also, if it was a Python, rather than Java, application, I’d be more comfortable with it and would possibly start to poke around inside it a bit….

I guess one of the things can do (though I’ve never really had to push it) is scale with larger datasets, although the memory overhead may then become an issue? I the the R/Pandas crossover folk have been doing a lot of work on efficient datatable representations and scaleable tabular data interchange formats, and I’m not sure if OpenRefine is/will draw on any of that work?

It’s also been some time since I looked at Workbench, (indeed, I haven’t really looked at it since I posted an early review), but a quick peek at the repo shows a fair amount of activity. Maybe I should look at it again…?

How to Create a Simple Dockerfile for Building an OpenRefine Docker Image

Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).

So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.

First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):

#Bring in a base container
#Alpine is quite lite, and we can get a build with JDK-8 already installed
FROM maven:3.6.0-jdk-8-alpine
MAINTAINER thadguidry@gmail.com

#We need to install git so we can clone the OpenRefine repo
RUN apk add --no-cache git

#Clone the current repo
RUN git clone https://github.com/OpenRefine/OpenRefine.git 

#Build the OpenRefine application
RUN OpenRefine/refine build

#Create a directory we can save OpenRefine user project files into
RUN mkdir /mnt/refine

#Mount a Docker volume against that directory.
#This means we can save data to another volume and persist it
#if we get rid of the current container.
VOLUME /mnt/refine

#Expose the OpenRefine server port outside the container
EXPOSE 3333

#Command to start the OpenRefine server when the container starts
CMD ["OpenRefine/refine", "-i", "0.0.0.0", "-d", "/mnt/refine"]

You can build the container from that Dockerfile by cding into the same directory as the Dockerfile and running something like:

docker build -t psychemedia/openrefine .

The -t flag tags the image (that is, names it); the . says look to the current directory for the dockerfile.

You could then run the container using something like:

docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine

One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?

I think we can also add RUN OpenRefine/refine dist RELEASEVERSION to then create a release, but there is a downside that this step will fail if a test fails.

We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.

So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.

The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:

https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz.

So how do we install an OpenRefine server from a distribution file?

#We can use the smaller JRE rather than the JDK
FROM openjdk:8-jre-alpine as builder

MAINTAINER tony.hirst@gmail.com

#Download a couple of required packages
RUN apk update && apk add --no-cache wget bash

#We can pass variables into the build process via --build-arg variables
#We name them inside the Dockerfile using ARG, optionally setting a default value
ARG RELEASE=3.1

#ENV vars are environment variables that get baked into the image
#We can pass an ARG value into a final image by assigning it to an ENV variable
ENV RELEASE=$RELEASE

#There's a handy discussion of ARG versus ENV here:
#https://vsupalov.com/docker-arg-vs-env/

#Download a distribution archive file
RUN wget --no-check-certificate https://github.com/OpenRefine/OpenRefine/releases/download/$RELEASE/openrefine-linux-$RELEASE.tar.gz

#Unpack the archive file and clear away the original download file
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz

#Create an OpenRefine project directory
RUN mkdir /mnt/refine

#Mount a Docker volume against the project directory
VOLUME /mnt/refine

#Expose the server port
EXPOSE 3333

#Create the state command.
#Note that the application is in a directory named after the release
#We use the environment variable to set the path correctly
CMD openrefine-$RELEASE/refine -i 0.0.0.0 -d /mnt/refine

We can now build an image of the default version as baked into the Dockerfile:

docker build -t psychemedia/openrefinedemo .

Or we can build against a specific version:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

To peek inside the container, we run it and jump into a bash shell inside it:

docker run --rm -i -t psychemedia/openrefinedemo /bin/bash

We run the container as before:

docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

Useful?

PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)

Running OpenRefine in the Clear on Digital Ocean

In a couple of earlier posts, I’ve described how to get OpenRefine up and running remotely over the web by installing the OpenRefine server onto a Digital Ocean Linux server and running it there behind a simple authenticating proxy(Running OpenRefine On Digital Ocean Using Simple Auth and the more automated Authenticated OpenRefine Server on Digital Ocean, Redux).

In this post I’ll show how to set up a simple OpenRefine server, without authentication, using Docker (I’ll show how to add in the authenicating nginx proxy in a follow on post).

Docker is a virtualisation technology that heavily draws on the idea of “containers”, isolated computational environments that provide just enough operating system to run a particular application within them.

As well as hosting raw Linux servers, Digital Ocean also provides Linux-servers-with-docker as a one-click application.

Here’s how to start a docker machine on Digital Ocean.

Creating a Digital Ocean Docker Droplet

First up, create a new droplet as a one-click app, selecting docker as the one-click application type:

To give ourselves some space to work with, I’m going to choose the 3GB server (it may work with default settings in a 2GB server, or it may ruin your day…). It’s metered by the hour, but it’ll still only cost a few pennies for a quick demo. (You can also get $100 free credit as a new user if you sign up here.)

DigitalOcean_-_Create_Droplets_

Select a data center region (I typically go for a local one):

If you want to, add your SSH key (recipe here, but it’s not really necessary: the ssh key just makes it easier for you to login to the server from your own computer if you need to. If you haven’t heard of ssh keys before, ignore this step!)

Hit the big green button to create your droplet (if you want to, give the sever a nicer hostname first…).

Accessing the Digital Ocean Droplet Server Terminal

Your one-click docker server will now start up. Once its there (it should take less than a minute) click through to its admin page. Assuming you haven’t added ssh keys, you’ll need to log in through the console. The login details for your server should have been emailed to the email address associated with your Digital Ocean account. Use them to login.

On first login, you’ll be prompted to change the password (it was emailed to you in plain text after all!)

If you choose a really simple replacement password, you may need to choose another one. Also note that the (current) UNIX password was the one you were emailed, so you’ll essentially be providing this password twice in quick succession (once for the first login, then again to authorise the enforced password change). Copy and pasting the password into the console from your email should work…

Once you’ve changed your password, you’ll be logged out and you’ll have to log back in again with your new password. (Isn’t security a faff?! That’s why ssh keys…!;-)

Now you get to install and launch OpenRefine. I’ve got an example image here, and the recipe for creating it here, but you don’t need really need to look at that if you trust me…

What you do need to do is run:

docker run -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

What this command does is download and run the container psychemedia/openrefinedemo, naming it (purely for our convenience) as openrefine.

You can learn how to create an OpenRefine docker image here: How to Create a Simple Dockerfile for Building an OpenRefine Docker Image.

The -d flags runs the container in “detached”, standalone mode (in the background, essentially). The -p 3333:3333 is read as -p PUBLICPORT:INTERNALPORT. The OpenRefine server is started on INTERNALPORT=3333 and we’re also going to view it on a URL port 3333.

The container will take a few seconds to download if this is the first time you’ve called for it:

and then it’ll a print out a long id number once it’s launched and running the background.

(You can check it’s running by running the command docker ps.)

In both the terminal and the droplet admin pages, as well as the droplet status line in the current droplet listing pages, you should see the public IP address associated with the droplet. Copy that address into your browser and add the port mapping (:3333). You should now be able to see a running version of OpenRefine. (And so should anyone else who wanders by that URL:PORT combination…)

Let’s now move the application to another port. We could do this by launching another container, with a new unique name (container names, when we assign them, need to be unique) and assigned to another port. (The OpenRefine internal service port will remain the same). For example:

docker run -d -p 3334:3333 --name openrefine2 psychemedia/openrefinedemo

This creates a new container running a fresh instance of the OpenRefine server. You should see it on IPADDRESS:3334.

(Alternatively we can omit the name and a random one will be assigned, for example, docker run -d -p 3335:3333)

Note that the docker image does not need to be downloaded again. We simply reuse the one we downloaded previously, and spawn a new instance of it as a new container.

Each container does take up memory though, so kill the original container:

docker kill openrefine

and remove it:

docker rm openrefine.

For a last quick demo, let’s create a new instance of the contain, once again called openrefine (assuming we’ve removed the one previously called that) and run it on port 80, the default http port, which means we should be able to see it directly by going to just the IPADDRESS (with no port specified) in our browser:

docker run -d -p 80:3333 --name openrefine psychemedia/openrefinedemo

When you’re done, you can halt the droplet (in which case, you’ll keep on paying rent for it) or destroy it (which means you won’t be billed for any additional hours, or parts thereof, on top of the time you’ve already been running the droplet):

You don’t need to tidy up around the docker containers, they’ll die with the droplet.

So, not all that hard, is it? Probably a darn sight easier than trying to get anything out of your local IT unit?!

In the next post, I’ll show how to combine the container with another one containing nginx to provide some simple authentication. (There are lots of prebuilt containers out there we can just take “off-the-shelf”, and nginx is one of them.) I’ll maybe also have a look at how you might persist projects in hibernating container / droplet, perhaps look at how we might be able to upload files that OpenRefine can work on, and maybe even try to figure out a way to simply synch your project files from Digital Ocean to your own file storage location somewhere. Maybe…

PS third party nginx proxy example: https://github.com/beevelop/docker-nginx-basic-auth

Authenticated OpenRefine Server on Digital Ocean, Redux

Following on from yesterday’s recipe showing how to Running OpenRefine On Digital Ocean Under Simple Auth, here’s an even easier way that doesn’t require ssh and doesn’t require console access: just copy and paste some set-up info into a form on the Digital Ocean droplet creation page.

Every little everything helps… this link will set new users up with $100 of free credit on Digital Ocean; somewhere down the line I may get a small amount of affiliate link powered Digital Ocean credit to help keep my own server costs covered…

The Digital Ocean droplet creator has an option to add a User data start-up script when the droplet is created (docs).

This means we can provide a script that will download and install everything we need, including a file to define a service that will run OpenRefine.

Copy and paste the script in the gist that can be found here into the user data area before you create the droplet. The code will be executed once the droplet is up and running and should install the nginx proxy and the OpenRefine application. A default user (test) and password (letmein) are defined in script.

If you are trusting of the gist, you can install it more succinctly from the gist repository. You can also define your own user and password. To take this installation route, use the code that follows below in the userdata area:

Here’s the code…:

#!/bin/bash

#We can over-ride the default credentials
USER_NAME=testuser
USER_PWD=testpwd

#Get the latest version of installer script and run it

source <( curl -s https://gist.githubusercontent.com/psychemedia/f256960c112347dd410c2beec8ce05e3/raw/ )

(There should be no spaces between the less than and open bracket characters in the source command.)

For an explicit link to a particlar version of the script, go to the gist, and copy the link for the raw version of the latest file or the version you want.

Note that this route requires you to place considerable trust in the publisher of the remote installation script. Do you really trust a file with such a dodgy filename?

It will probably take two or three minutes to download and install everything, so don’t be too worried if you don’t see anything at the provided IP address immediately. Just keep refreshing the page…

The first sign of life you should see is the nginx default page…

Wait a few moments and reload the page… Next up is a holding page as the OpenRefine application is installed and the service started…

This page should automatically refresh itself every 5 seconds or so. When the service is up and running, you should get a page like this:

Click the link and you should be prompted with the the user/password authenticator challenge. Provide the username/password and you should be taken to your OpenRefine server.

Running OpenRefine On Digital Ocean Using Simple Auth

As a cloud host, Digital Ocean provides a really easy way in to getting services up and running on the web.

Here’s a quick recipe for getting Open Refine up and running behind a simple authentication scheme.

Creating a Server With Digital Ocean

First up, create a Digital Ocean account if you don’t already have one (this link will get you started with $100 free credit).

Creating and launching a server is easy… Select Create then Droplet, choose the server type you want — let’s use a simple Ubuntu box — and choose the server size. For lots of quick tests, I use the smallest box, but from experience I think OpenRefine prefers MORE THAN 2GB or more

Really… If you pick a 2GB server, you may find that OpenRefine hangs on start and ruins your day when you end up trying to debug what you think other other problems…. Be warned, kids… stay unfrustrated out there…

The servers are charged at a metered rate, and you can stop them any time, so for a quick test, it’ll cost you pennies… (A $100 credit can go a long way…!)

Next, choose a data center region; I generally pick a local one…

You also have the option of adding an ssh key. This makes life much easier when trying to log in to the server from your own machine using ssh (you can just run ssh root@IP_ADDRESS and it’ll log you straight in; there’s a recipe for setting up an SSH key here).

If you don’t want to set up an SSH, a root password will be emailed to you. You can use this password to log in to your server via a web terminal, which means you can do everything via a web UI if you need to…

Create your server by clicking the big green button…

It should only take a few seconds to start up… And when it has, you’ll be presented with it’s public IP address.

If you need a web terminal, click through the on the server name, and you should see a link to launch a web console.

Installing  OpenRefine

From the console, we can install all we need to run OpenRefine. This is a minimum viable example — we should probably find a better place to install OpenRefine, and may want to run it as a particular user with limited permissions. Working as root with everything wide open makes life easier, though not necessarily safer…!

OpenRefine requires a Java environment, so we need to install that:

apt-get update && apt-get install -y openjdk-8-jre

We can download the OpenRefine application via the command-line using a command of the form wget -q -O DOWNLOADED_FILE_NAME URL; the download link for each release can be found on the OpenRefine releases page:

wget -q -O openrefine-2.8.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/2.8/openrefine-linux-2.8.tar.gz

The downloaded file is provided as a tar archive file, which we need to unpack:

tar xzf openrefine-2.8.tar.gz

This will unbundle the files into the directory ./openrefine-2.8.

Let’s create an alias for a working directory in which to place the OpenRefine project files:

OPENREFINE_DIR="$HOME/openrefine"

And create that directory:

mkdir -p $OPENREFINE_DIR

You should now be able to run OpenRefine in the background using the command:

nohup openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

This will run OpenRefine on port 3333. If you copy the IP address of your Digital Ocean server, and go to http://IP.ADDRESS:3333, you should see OpenRefine running there. (Note that if you’re in Chrome, Google may well tell you that the address is dangerous…)

Adding Simple Authentication

The OpenRefine server is running as a public service on the public web. If you want to add a simple layer of authentication, you can add a web server proxy to the server that will prompt for a password when a new visit is made to the server.

One of the easiest proxies to get up and running is nginx. Let’s install it, along with a simple Apache toolkit that will help us create a simple password:

apt-get install -y nginx apache2-utils

Now create a simple user/password combination. Ever secure, I’ll go with user test and password letmein:

htpasswd -b -c /etc/nginx/.htpasswd test letmein

Now we need to define the proxy. A Digital Ocean tutorial (How To Install Nginx on Ubuntu 18.04) describes how to set up a firewall – I’m selecting the 'Nginx Full' option because I’m working via SSH, but if you’re working in the web terminal, the more restrictive 'Nginx HTTP' may be more appropriate:

sudo ufw allow 'Nginx Full'

If you try loading the OpenRefine server on port 3333, you should now find that it’s blocked: the firewall is only letting web traffic through on port 80.

We now need to open access back up to the OpenRefine server, albeit via a password challenge. The following will create a default nginx configuration file that will expose the OpenRefine service running on port 3333 via default http port 80, mediated by a simple authorisation challenge:

config='''
server {
  listen 80;
  auth_basic Protected...;
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://127.0.0.1:3333;
  }
}
'''

echo "$config" > /etc/nginx/sites-available/default

Restart the nginx proxy to put the new settings into effect:

nginx -s reload

If you now go to http://IPADDRESS you should be presented with a challenge. Enter the credentials you defined, and you should see your OpenRefine server:

Finishing Up

When you’ve finished your session, you can destroy the droplet. This will tear the server down and you won’t be billed for it anymore.

Alternatively, you can switch the droplet off, but keep it in a shutdown state that you can restart in the future.

However, as the above prompt suggests, you will continue to be billed, even if the service is not running, because it is still consuming Digital Ocean resources.

Quick Notes – OpenRefine – Working With Databases

It seems as if the latest version of OpenRefine – v. 3.0 – is out in beta ([repo]) and it looks at if some integration with databases is provided.

On the way in, you can connect to a database and import data from it into OpenRefine directly:

Looks interesting, though part of me wonders “why?”: if the data is in the database then presumably you’ve started to clean it anyway. One of the things I’ve realised working on some revisions for our TM351 data management and analysis course is that you can use table constraints as part of a data cleaning workflow to flag up errors in your data. It could be interesting to imagine how constraint violation errors might be used to support data cleaning activities in an application such as OpenRefine. (In a sense, you might think of applying facets to a column as the application of rules or constraints that can identify rows containing data that violates them.)

On the way out is the option to export a project as a SQL file. This includes defining column types:

and also configuring other aspects of the SQL export file, such as exporting table definitions, either with or without the data, and optionally including DROP and IF EXISTS statements:

Here’s an example of the sort of thing it can generate:

DROP TABLE IF EXISTS clipboard;
CREATE TABLE clipboard (
a INT NULL,
b VARCHAR(12) NULL,
c INT NULL
);
INSERT INTO clipboard (a,b,c) VALUES
( 1,'e',1 ),
( 2,'w',3 ),
( 2,'tg',56 )

Going back to the notion of constraints, if an exportable SQL table is defined with particular column types, it would be handy to be able to use the type definitions to check the data being exported. For example, having set the types, apply those types as exclusive facet rules on the relevant columns and then raise an alert and/or preview the violating rows if any rows are returned that break the type rules.

If such checking was possible, you could start to imagine round tripping on the data, loading in a table definition and using it to check some aspects of a dataset loaded alongside it (such as data types, uniqueness etc.) Which reminds me, the SQL exporter doesn’t allow seem to support UNIQUE or PRIMARY KEY definitions?

IBM DataScientistWorkBench = OpenRefine + RStudio + Jupyter Notebooks in the Cloud, Via Your Browser

One of the many things on my “to do” list is to put together a blogged script that wires together RStudio, Jupyter notebook server, Shiny server, OpenRefine, PostgreSQL and MongDB containers, and perhaps data extraction services like Apache Tika or Tabula and a few OpenRefine style reconciliation services, along with a common shared data container, so the whole lot can be launched on Digital Ocean at a single click to provide a data wrangling playspace with all sorts of application goodness to hand.

(Actually, I think I had a script that was more or less there for chunks of that when I was looking at a docker solution for the databases courses, but that fell by the way side and I suspect the the Jupyter container (IPython notebook server, as was), probably needs a fair bit of updating by now. And I’ve no time or mental energy to look at it right now…:-(

Anyway, the IBM Data Scientist Workbench now sits alongside things like KMis longstanding KMi Crunch Learning Analytics Environment (RStudio + MySQL), and the Australian ResBaz Cloud – Containerised Research Apps Service in my list of why the heck can’t we get our act together to offer this sort of SaaS thing to learners? And yes I know there are cost applications…. but, erm, sponsorship, cough… get-started tokens then PAYG, cough…

It currently offers access to personal persistent storage and the ability to launch OpenRefine, RStudio and Jupyter notebooks:

Data_Scientist_Workbench

The toolbar also suggest that the ability to “discover” pre-identified data sources and run pre-configured modeling tools is also on the cards.

The applications themselves run off a subdomain tied to your account – and of course, they’re all available through the browser…

RStudio_and_OpenRefine_and_Data_Scientist_Workbench_and_Edit_Post_‹_OUseful_Info__the_blog____—_WordPress

So what’s next? I’d quite like to see ‘data import packs’ that would allow me to easily pull in data from particular sources, such as the CDRC, and quickly get started working with the data. (And again: yes, I know, I could start doing that anyway… maybe when I get round to actually doing something with isleofdata.com ?!;-)

See also these recipes for running app containers on Digital Ocean via Tutum: RStudio, Shiny server, OpenRefine and OpenRefine reconciliation services, and these Seven Ways of Running IPython / Jupyter Notebooks.

Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:

ROGUE CODE

One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:

regex1

If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)

regex2

To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)

regex2a

However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex3

So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)

regex4

Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')

regex5a

This generates rows of the form:

regex6

We can now split these cells into several columns:

regex7

We use the :: sequence as the separator:

regex8

Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.

regex9

We now need to do a little more cleaning. For example, tidying up column names:

regex10

And then cleaning down each new column to remove the column heading.

regex11

As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):

regex12

To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)

regex13

As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)

OpenRefine Style Reconciliation Containers

Over the weekend, I rediscovered Michael Bauer/@mihi_tr’s (RIP) Reconcile CSV [code] service that builds an OpenRefine reconciliation service on top of a CSV file. One column in the CSV file contains a list of values that you want to reconcile (that is, fuzzy match) against, the other is a set of key identifier values associated with the matched against value.

Having already popped OpenRefine into a docker container, I thought I’d also explore dockerising Michael’s service: docker-reconciliation.

The default container uses a CSV file of UK MP names (current and previous) and returns their full title and an identifier used in the UK Parliament Members’ names data platform.

To run service in boot2docker:

  • docker run -p 8002:8000 --name mprecon -d psychemedia/docker-reconciliation
  • boot2docker ip to get the IP address the service is running on, eg 192.168.59.103
  • Test the service in your browser: http://192.168.59.103:8002/reconcile?query=David Cameroon

In OpenRefine, set the reconciliation service URL to http://192.168.59.103:8002/reconcile.

NOTE: I had thought I should be able to fire up linked OpenRefine and ReconcileCSV containers and address more conveniently, for example:

docker run --name openrefiner -p 3335:3333 --link mprecon:mprecon -d psychemedia/openrefine

and then setting something like http://mprecon:8000/reconcile as the reconciliation service endpoint, but that didn’t seem to work? Instead I had to use the endpoint routed to host (http://192.168.59.103:8002/reconcile).

I also added some command line parameters to the script so that you can fire up the container and reconcile against your own CSV file:

docker run -p 8003:8000 -v /path/to/:/tmp/import -e RECONFILE=myfile.csv -e SEARCHCOL=mysearchcol -e IDCOL=myidcol --name recon_mycsv -d psychemedia/docker-reconciliation

This loads in the file on your host computer at /path/to/myfule.csv using the column named mysearchcol for the search/fuzzy match values and the column named myidcol for the identifiers.

It struck me that I could then commit this customised container as a docker image, and push it to dockerhub as a tagged image. Permissions mean I can’t push to the original trusted/managed repository that builds containers from my github repo, but I can create a new dockerhub repository containing tagged images. For example:

docker commit recon_mycsv psychemedia/docker-reconciler:recon_mycsv
docker push psychemedia/docker-reconciler:recon_mycsv

This means I can collect a whole range of reconciliation services, each independently tagged, at psychemedia/docker-reconciler – tags.

So for example:

  • docker run --name reconcile_ukmps -p 8008:8000 -d psychemedia/docker-reconciler:ukmps_pastpresent runs a reconciliation service agains UK past and present MPs on port 8008;
  • docker run --name reconcile_westminster -p 8009:8000 -d psychemedia/docker-reconciler:westminster_constituency runs a reconciliation service against Westminster constituencies on port 8009.

In practice the current reconciliation service only seems to work well on small datasets, up to a few thousand lines, but nonetheless it can still be useful to be able to reconcile against such datasets. For larger files – such as the UK Companies House register, where we might use registered name for the search column and company number for the ID – it seems to take a while…! (For this latter example, a reconciliation service already exists at OpenCorporates.)

One problem with the approach I have taken is that the data file is mounted within the reconciliation server container. It would probably make more to sense have the RefineCSV container mount a data volume containing the CSV file, so that we can then upgrade the reconciliation server container once and then just link it to data containers. As it is, with the current model, we would have to rebuild each tagged image separately to update the reconciliation server they use.

Unfortunately, I don’t know of an easy way to package up data volume containers (an issue I’ve also come up against with database data stores). What I’d like to be able to do is have a simple “docker datahub” that I could push data volumes to, and then be able to say something like docker run -d --volumes-from psychemedia/reconciliation-data:westminster_constituency --name recon_constituencies psychemedia/reconciliation. Here, --volumes-from would look up data volume containers on something like registry.datahub.docker.com and psychemedia/reconciliation from registry.hub.docker.com.

So where’s all this going, and what next? First up, it would be useful to have an official Dockerfile that builds Michael’s Reconcile CSV server. (It would also be nice to see an example of a Python based reconciliation server – because I think I might be able to hack around with that! [UPDATE – there is one here that I forked here and dockerised here]) Secondly, I really need to find a way through the portable data volume container mess. Am I missing something obvious? Thirdly, the reconciliation server needs a bit of optimisation so it can work with larger files, a fast fuzzy match of some sort. (I also wonder whether a lite reconciliation wrapper for PostgreSQL would be useful that can leverage the PostgreSQL backend and fuzzy search plugin to publish a reconciliation service?)

And what’s the payoff? The ability to quickly fire up multiple reconciliation services against reference CSV documents.