Running OpenRefine in the Clear on Digital Ocean

In a couple of earlier posts, I’ve described how to get OpenRefine up and running remotely over the web by installing the OpenRefine server onto a Digital Ocean Linux server and running it there behind a simple authenticating proxy(Running OpenRefine On Digital Ocean Using Simple Auth and the more automated Authenticated OpenRefine Server on Digital Ocean, Redux).

In this post I’ll show how to set up a simple OpenRefine server, without authentication, using Docker (I’ll show how to add in the authenicating nginx proxy in a follow on post).

Docker is a virtualisation technology that heavily draws on the idea of “containers”, isolated computational environments that provide just enough operating system to run a particular application within them.

As well as hosting raw Linux servers, Digital Ocean also provides Linux-servers-with-docker as a one-click application.

Here’s how to start a docker machine on Digital Ocean.

Creating a Digital Ocean Docker Droplet

First up, create a new droplet as a one-click app, selecting docker as the one-click application type:

To give ourselves some space to work with, I’m going to choose the 3GB server (it may work with default settings in a 2GB server, or it may ruin your day…). It’s metered by the hour, but it’ll still only cost a few pennies for a quick demo. (You can also get $100 free credit as a new user if you sign up here.)


Select a data center region (I typically go for a local one):

If you want to, add your SSH key (recipe here, but it’s not really necessary: the ssh key just makes it easier for you to login to the server from your own computer if you need to. If you haven’t heard of ssh keys before, ignore this step!)

Hit the big green button to create your droplet (if you want to, give the sever a nicer hostname first…).

Accessing the Digital Ocean Droplet Server Terminal

Your one-click docker server will now start up. Once its there (it should take less than a minute) click through to its admin page. Assuming you haven’t added ssh keys, you’ll need to log in through the console. The login details for your server should have been emailed to the email address associated with your Digital Ocean account. Use them to login.

On first login, you’ll be prompted to change the password (it was emailed to you in plain text after all!)

If you choose a really simple replacement password, you may need to choose another one. Also note that the (current) UNIX password was the one you were emailed, so you’ll essentially be providing this password twice in quick succession (once for the first login, then again to authorise the enforced password change). Copy and pasting the password into the console from your email should work…

Once you’ve changed your password, you’ll be logged out and you’ll have to log back in again with your new password. (Isn’t security a faff?! That’s why ssh keys…!;-)

Now you get to install and launch OpenRefine. I’ve got an example image here, and the recipe for creating it here, but you don’t need really need to look at that if you trust me…

What you do need to do is run:

docker run -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo

What this command does is download and run the container psychemedia/openrefinedemo, naming it (purely for our convenience) as openrefine.

You can learn how to create an OpenRefine docker image here: How to Create a Simple Dockerfile for Building an OpenRefine Docker Image.

The -d flags runs the container in “detached”, standalone mode (in the background, essentially). The -p 3333:3333 is read as -p PUBLICPORT:INTERNALPORT. The OpenRefine server is started on INTERNALPORT=3333 and we’re also going to view it on a URL port 3333.

The container will take a few seconds to download if this is the first time you’ve called for it:

and then it’ll a print out a long id number once it’s launched and running the background.

(You can check it’s running by running the command docker ps.)

In both the terminal and the droplet admin pages, as well as the droplet status line in the current droplet listing pages, you should see the public IP address associated with the droplet. Copy that address into your browser and add the port mapping (:3333). You should now be able to see a running version of OpenRefine. (And so should anyone else who wanders by that URL:PORT combination…)

Let’s now move the application to another port. We could do this by launching another container, with a new unique name (container names, when we assign them, need to be unique) and assigned to another port. (The OpenRefine internal service port will remain the same). For example:

docker run -d -p 3334:3333 --name openrefine2 psychemedia/openrefinedemo

This creates a new container running a fresh instance of the OpenRefine server. You should see it on IPADDRESS:3334.

(Alternatively we can omit the name and a random one will be assigned, for example, docker run -d -p 3335:3333)

Note that the docker image does not need to be downloaded again. We simply reuse the one we downloaded previously, and spawn a new instance of it as a new container.

Each container does take up memory though, so kill the original container:

docker kill openrefine

and remove it:

docker rm openrefine.

For a last quick demo, let’s create a new instance of the contain, once again called openrefine (assuming we’ve removed the one previously called that) and run it on port 80, the default http port, which means we should be able to see it directly by going to just the IPADDRESS (with no port specified) in our browser:

docker run -d -p 80:3333 --name openrefine psychemedia/openrefinedemo

When you’re done, you can halt the droplet (in which case, you’ll keep on paying rent for it) or destroy it (which means you won’t be billed for any additional hours, or parts thereof, on top of the time you’ve already been running the droplet):

You don’t need to tidy up around the docker containers, they’ll die with the droplet.

So, not all that hard, is it? Probably a darn sight easier than trying to get anything out of your local IT unit?!

In the next post, I’ll show how to combine the container with another one containing nginx to provide some simple authentication. (There are lots of prebuilt containers out there we can just take “off-the-shelf”, and nginx is one of them.) I’ll maybe also have a look at how you might persist projects in hibernating container / droplet, perhaps look at how we might be able to upload files that OpenRefine can work on, and maybe even try to figure out a way to simply synch your project files from Digital Ocean to your own file storage location somewhere. Maybe…

PS third party nginx proxy example:

How to Create a Simple Dockerfile for Building an OpenRefine Docker Image

Over the last few weeks, I’ve been exploring serving OpenRefine in a various ways, such as on a vanilla Digital Ocean Linux server or using Docker, as well as using MyBinder (blog post to come…).

So picking up on the last post (OpenRefine on Digital Ocean using Docker), here’s a quick walkthrough of how we can go about creating a Dockerfile, the script used to create a Docker container, for OpenRefine.

First up, an annotated recipe for building OpenRefine from scratch from the current repo from Thad Guidry (via):

#Bring in a base container
#Alpine is quite lite, and we can get a build with JDK-8 already installed
FROM maven:3.6.0-jdk-8-alpine

#We need to install git so we can clone the OpenRefine repo
RUN apk add --no-cache git

#Clone the current repo
RUN git clone 

#Build the OpenRefine application
RUN OpenRefine/refine build

#Create a directory we can save OpenRefine user project files into
RUN mkdir /mnt/refine

#Mount a Docker volume against that directory.
#This means we can save data to another volume and persist it
#if we get rid of the current container.
VOLUME /mnt/refine

#Expose the OpenRefine server port outside the container

#Command to start the OpenRefine server when the container starts
CMD ["OpenRefine/refine", "-i", "", "-d", "/mnt/refine"]

You can build the container from that Dockerfile by cding into the same directory as the Dockerfile and running something like:

docker build -t psychemedia/openrefine .

The -t flag tags the image (that is, names it); the . says look to the current directory for the dockerfile.

You could then run the container using something like:

docker build --rm -d --name openrefine -p 3334:3333 psychemedia/openrefine

One of the disadvantages of the above build process is that it produces a container that still contains the build files, and tooling required to build it, as well as the application files. This means that the container is larger than it need be. it’s also not quite a release?

I think we can also add RUN OpenRefine/refine dist RELEASEVERSION to then create a release, but there is a downside that this step will fail if a test fails.

We’d then have to tidy up a bit, which we could do with a multistage build. Simon Willison has written a really neat sketch around this on building smaller Python Docker images that provides a handy crib. In our case, we could FROM the same base container (or maybe a JRE, rather than JDK, populated version, if OpenRefine can run just with a JRE?) and copy across the distribution file create from the distribution build step; from that, we could then install the application.

So let’s go to that other extreme and look at a Dockerfile for building a container from a specific release/distribution.

The OpenRefine releases page lists all the OpenRefine releases. Looking at the download links for the the Linux distribution, the URLs take the form:$RELEASE/openrefine-linux-$RELEASE.tar.gz.

So how do we install an OpenRefine server from a distribution file?

#We can use the smaller JRE rather than the JDK
FROM openjdk:8-jre-alpine as builder


#Download a couple of required packages
RUN apk update && apk add --no-cache wget bash

#We can pass variables into the build process via --build-arg variables
#We name them inside the Dockerfile using ARG, optionally setting a default value

#ENV vars are environment variables that get baked into the image
#We can pass an ARG value into a final image by assigning it to an ENV variable

#There's a handy discussion of ARG versus ENV here:

#Download a distribution archive file
RUN wget --no-check-certificate$RELEASE/openrefine-linux-$RELEASE.tar.gz

#Unpack the archive file and clear away the original download file
RUN tar -xzf openrefine-linux-$RELEASE.tar.gz  && rm openrefine-linux-$RELEASE.tar.gz

#Create an OpenRefine project directory
RUN mkdir /mnt/refine

#Mount a Docker volume against the project directory
VOLUME /mnt/refine

#Expose the server port

#Create the state command.
#Note that the application is in a directory named after the release
#We use the environment variable to set the path correctly
CMD openrefine-$RELEASE/refine -i -d /mnt/refine

We can now build an image of the default version as baked into the Dockerfile:

docker build -t psychemedia/openrefinedemo .

Or we can build against a specific version:

docker build -t psychemedia/openrefinedemo --build-arg RELEASE=3.1-beta .

To peek inside the container, we run it and jump into a bash shell inside it:

docker run --rm -i -t psychemedia/openrefinedemo /bin/bash

We run the container as before:

docker run --rm -d -p 3333:3333 --name openrefine psychemedia/openrefinedemo


PS Note that when running an OpenRefine container on something like Digital Ocean using the default OpenRefine memory settings, you may have trouble starting OpenRefine on machines smaller that 3GB. (I’ve had some trouble getting it started on a 2GB server.)

OpenRefine Database Connections in MyBinder

With the version 3.0 release of OpenRefine last year, database integration was introduced that allows data to be imported into OpenRefine from a connected database, or exported to a downloadable SQL datadump. (It doesn’t look like you can save/export data to a new database table in the connected database, or upsert the contents of a cleaned table). This was the release of the  OpenRefine Database Import Extension and the SqlDump export mentioned in this earlier post.

If you want to try it out, I’ve created a MyBinder / repo2docker configuration repo that will launch a MyBinder repo containing both a running OpenRefine server and a running PostgreSQL server, although the test table is very small…

For how to run Postgres in a MyBinder container, see Running a PostgreSQL Server in a MyBinder Container.

Start in OpenRefine client: Binder

Details are:

  • host: localhost
  • Port: 5432
  • User: testuser
  • Password: testpass
  • Database: testdb

There’s also a tiny seeded table in the database called quickdemo from which we can import data into OpenRefine:

I said it was a small table!

The rest of the db integration — SQL export — is described in the aforementioned post on OpenRefine’s SQL integration.

I have to admit I’m not sure what the workflow is? You’d typically want to put clean data into a database, rather than pull data from a database into OpenRefine for cleaning.

If you are using OpenRefine as a data cleaning tool, it would be useful to be able to export the data directly back into the connected database, either as an upserted table (as well as perhaps some row deletions) or as a new ..._clean table (“Upsert to database…”).

If you’re using OpenRefine as a data enrichment tool, being able to create a new, enriched table back in the connected database (“Export to database…”) would also make sense.

One of the things I’ll add to the to-do list is an example of how to export data from OpenRefine and then import it into the database using a simple Jupyter notebook script (a Jupyter notebook server is also running in the  MyBinder container (just delete the openrefine/ path from the MyBinder URL).

One of the new (to me) things I’ve spotted in OpenRefine 3 is the ability to export a Project Data Package. I mistakenly thought this might be something like a Frictionless Data data package format, but it looks to just be an export format for the OpenRefine project data? There are fields for import settings as well as descriptive metadata, but I don’t see any dialogues in the UI where you’d enter things like creator, contributors or description?

  "name": "clipboard",
  "tags": [],
  "created": "2019-02-09T22:13:42Z",
  "modified": "2019-02-09T22:14:31Z",
  "creator": "",
  "contributors": "",
  "subject": "",
  "description": "",
  "rowCount": 2,
  "title": "",
  "homepage": "",
  "image": "",
  "license": "",
  "version": "",
  "customMetadata": {},
  "importOptionMetadata": [
      "guessCellValueTypes": false,
      "projectTags": [
      "ignoreLines": -1,
      "processQuotes": true,
      "fileSource": "(clipboard)",
      "encoding": "",
      "separator": ",",
      "storeBlankCellsAsNulls": true,
      "storeBlankRows": true,
      "skipDataLines": 0,
      "includeFileSources": false,
      "headerLines": 1,
      "limit": -1,
      "quoteCharacter": "\"",
      "projectName": "clipboard"

One of the column operations you can perform inOpenRefine is to cast columns to text, dates or numerics, but I don’t think that is saved as metadata anywhere? You can also define column types in the SQL exporter, but again, I’m not sure that then becomes project metadata. It’d be good to see these things unified a bit, and framing such a process in terms of supporting a tabular data package (with things like column typing specified) could be useful.

Another foil for this might be supporting a SQLite export format?

I have to admit I’m a bit confused as to how OpenRefine sits where in different workflows, particularly with data that is managed, and as such is most likely to be stored in some sort of database? (Lots of the OpenRefine tooling still harkens to a Linked Data future, so maybe it fits better in Linked Data workflows?). I also get the feeling that it shares a possible overlap with query engine tools such as Apache Drill, and maybe even document data extraction tools such as Apache Tika or Tabula. Again, seeing demonstrated toolchains and workflows in this area could be interesting.

Note to self: there are several other PDF table extractor tools out there alongside Tabula (Java) that I haven’t played with; eg R/pdftools, Python/Camelot and Python/pdfplumber.

Simon Willison is doing all sorts of useful stuff framing datasette as a datasette / SQLite ecosystem play. It could be useful to think a bit more about OpenRefine in terms of how it integrates with other data tools. For example, the X-to-sqlite tools help you start to structure variously formatted data sources in terms of a common SQLite representation, which can naturally incorporate things like column typing, but also the notion of database primary and foreign key columns. In a sense, OpenRefine provides a similar “import from anything-export to one format (CSV)” with a data cleaning step in the middle, but CSV is really informally structured in terms of its self-descriptive representation.

One of the insights I had when revising our TM351 relational database notebooks was that database table constraints can play a really useful role when helping clean a dataset by automatically identifying things that are wrong with it… I’ll maybe try to demonstrate an OpenRefine / Jupyter notebook hybrid workflow around that too…

By the by, I noticed this post the other day Exploring the dystopian future of a Javascript Gephi. Gephi, like OpenRefine, is a Java app, and like OpenRefine is one I’ve never been tempted to doodle with code wise for a couple of reasons: a) Java doesn’t appeal to me as a language; b) I don’t have a Java environment to hand, and the thought of trying to set up an environment, and all the build tools, as a novice, for a complex legacy project just leaves me cold. As the Gephi developers see it, “[w]e have to face it: the multiplatform is moving from Java to web technologies. Oracle wants a Java that powers backends, not a user interface framework.”

I’ve dabbled with OpenRefine off and on for years now, and whiles its browser accessibility is really handy, the docs could do with some attention (I guess that’s something I could make a positive contribution to). Also, if it was a Python, rather than Java, application, I’d be more comfortable with it and would possibly start to poke around inside it a bit….

I guess one of the things can do (though I’ve never really had to push it) is scale with larger datasets, although the memory overhead may then become an issue? I the the R/Pandas crossover folk have been doing a lot of work on efficient datatable representations and scaleable tabular data interchange formats, and I’m not sure if OpenRefine is/will draw on any of that work?

It’s also been some time since I looked at Workbench, (indeed, I haven’t really looked at it since I posted an early review), but a quick peek at the repo shows a fair amount of activity. Maybe I should look at it again…?

OpenRefine Running in MyBinder, Several Ways…

Python packages such as the Jupyter Server  Proxy allow you to use a Jupyter notebook server as a proxy for other services running in the same environment, such as a MyBinder container.

The jupyter-server-proxy package represents a generalisation of earlier demonstrations that showed how to proxy RStudio and OpenRefine in a MyBinder container.

In this post, I’ll serve up several ways of getting OpenRefine running in a MyBinder container. You can find all the examples in branches off psychemedia/jupyterserverproxy-openrefine.

Early original work on getting OpenRefine running in MyBinder was done by @betatim (betatim/openrefineder) using an earlier package, nbserverproxy; @yuvipanda helped me get my head round various bits of jupyterhub/jupyter-server-proxy/ which is key to proxying web services via Jupyter (jupyter-server-proxy docs). @manics provided the jupyter-server-proxy PR for handling predefined, rather than allocated, port mappings, which also made life much easier…

Common Installation Requirements

The following steps are pretty much common to all the recipes, and are responsibly for installing OpenRefine and its dependencies.

First, in a binder/apt.txt file, the Java dependency:


A binder/postBuild step to install a specific version of OpenRefine:

set -e

wget -q -O openrefine-$VERSION.tar.gz$VERSION/openrefine-linux-$VERSION.tar.gz
mkdir -p $HOME/.openrefine
tar xzf openrefine-$VERSION.tar.gz -C $HOME/.openrefine
rm openrefine-$VERSION.tar.gz

mkdir -p $HOME/openrefine

A binder/requirements.txt file to install the Python OpenRefine API client:


Note that this a fork of the original client that supports Python 3. It works with OpenRefine 2.8 but I’m not sure if it works properly with OpenRefine 3. There are multiple forks of the client and from what I can tell they are differently broken. It would be great of OpenRefine repo took on one fork as the official client that everyone could contribute to.

start definition – Autostarting headless OpenRefine Server


This start branch (repo) demonstrates:

  • using a binder/start file to auto start OpenRefine;
  • a notebook/client demo; this essentially runs in a headless mode.

The binder/start file extend the MyBinder start CMD to run the commands included in the file in addition to the default command to start the Jupyter notebook server. (The binder/start file can also be used to run things like the setting of environment variables. I’m not sure how to make available an environment variable defined in binder/postBuild inside binder/start?)


#Start OpenRefine
nohup $HOME/.openrefine/openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

exec "$@"

In this demo, you won’t be able to see the OpenRefine GUI using this demo. Instead, you can access it via its API using an OpenRefine python client. An included notebook gives a worked example (note that at the moment you can’t run the first few parts of the demo because they assume the presence of a pre-existing OpenRefine project. Instructions appear further down the notebook for creating a project and working with it using the API client; I’ll do a separate post on the OpenRefine Python client at some point…)

simpleproxy definition

The simpleproxy branch (repo) extends the start branch with a proxy that can be used to render the OpenRefine GUI.

The binder/requirements.txt needs an additional package — the jupyter-server-proxy package. (I’m using the repo version because at the time of writing the PyPi released version doesn’t include all the features we need…)


If you launch the Binder, it uses the serverproxy to proxy the OpenRefine port to proxy/3333/; note that the trailing slash is important. Without it, the static files (CSS etc) required to render the page are not resolved correctly.

traitlet-nolab definition

The traitlet-nolab branch (repo) uses the traitlet method (docs) to add a menu option to the Jupyter notebook homepage that allows OpenRefine to be started and launched from the notebook home New menu.

OpenRefine will also be started automatically if you start the MyBinder container with ?urlpath=openrefine or navigate directly to http://MYBINDERURL/openrefine.

Start on Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

In this case, the binder/start invocation is not required.

Once started, OpenRefine will appear on a named proxy path, openrefine (the slash may be omitted in this case).

The traitlet is defined in a file:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'title': 'OpenRefine'

This is copied into the correct location by an additional binder/postBuild step:

mkdir -p $HOME/.jupyter/

#Although located in binder/,
# this bash file runs in $HOME rather than $HOME/binder
mv $HOME/.jupyter/

The traitlet definition file is loaded in as a notebook server configuration file prior to starting the notebook server.

Note that the definition file uses the port: 3333 attribute to explicitly set the port that the server will be served against. If this is omitted, then a port will be dynamically allocated by the proxy server. In the case of OpenRefine, I am defining a port explicitly so that the Python API client can connect to it directly on the assumed default port 3333.

Note that if we try to use the Python client without starting the OpenRefine server by launching it, the connection will fail because there will be no running OpenRefine server for the client to connect to.

Python package setup definition

The setup branch (repo) demonstrates:

  • using serverproxy (setup definition (docs)) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assignment once again so that we can work with the client package using default port settings.

Start in Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

For this build, we go back to the base setup (no binder/start, not traitlet definition files) and add a file:

import setuptools

  # py_modules rather than packages, since we only have 1 file
      'jupyter_serverproxy_servers': [
          # name = packagename:function_name
          'openrefine = openrefine:setup_openrefine',

This calls on an file to define the configuration:

import os

def setup_openrefine():
  path = os.path.join(os.environ['HOME'], 'openrefine')
  return {
    'command': ['$HOME/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d',path],
    'port': 3333,
    'launcher_entry': {
        'title': 'OpenRefine',

As before, an OpenRefine option is added to the start menu and can be used to start the OpenRefine server and launch the UI client on the path openrefine. (As we started the server on a known port, can also find it explictly at proxy/3333.)

Calling the aliased URL directly will also start the server. This means we can tell MyBinder to open on the openrefine path (or add ?urlpath=openrefine to the Binder URL) and the container will open into the OpenRefine application.

Once again, we need to launch the OpenRefine app before we can connect to it from the Python client.

master branch – traitlet definition, Notebook and JupyterLab Support

The master branch (repo) builds on the traitlet definition branch and demonstrates:

  • using serverproxy (traitlet definition) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assigment so that we can work with the client package.
  • a button is also enabled and added to the JupyterLab launcher.

OpenRefine can now be started and launched from the notebook homepage New menu or from the JupyterLab launcher, via a ?urlpath=openrefine MyBinder luanch invocation, or by navigating directly to the proxied path openrefine.

Open to Notebook homepage: Binder

Open to OpenRefine: Binder

Open to Jupyterlab: Binder

In this case, we need to enable the JupyterLab extension with the following addition to the binder/postBuild file:

#Enable the OpenRefine icon in JuptyerLab desktop launcher
jupyter labextension install jupyterlab-server-proxy

This will enable a default start button in the JupyterLab launcher.

We can also provide an icon for the start button. Further modify the binder/postBuild file to copy the logo to a desired location:

#Although located in binder/,
mv open-refine-logo.svg $HOME/.jupyter/

and modify the by with the addition of a path to the start logo, also ensuring that the launcher entry is enabled:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'enabled': True,
            'icon_path': '/home/jovyan/.jupyter/open-refine-logo.svg',
            'title': 'OpenRefine',

We should now see a start button for OpenRefine in the JupyterLab launcher.

Clicking on the button will autostart the server an open a browser tab onto the OpenRefine application GUI.


Running OpenRefine in MyBinder using JupyterServerProxy allows us to use OpenRefine as part of a shareable, on demand, serverless, Jupyter mediated workbench, defined via a public Github repository.

As well as being access as a GUI application, and via the Python API client, OpenRefine can also connect to a PostgreSQL server running inside the MyBinder container. For running PostgreSQL inside MyBinder, see Running a PostgreSQL Server in a MyBinder Container; for connecting OpenRefine to a Postgres server running in the same MyBinder container, see OpenRefine Database Connections in MyBinder.