OpenRefine Running in MyBinder, Several Ways…

Python packages such as the Jupyter Server  Proxy allow you to use a Jupyter notebook server as a proxy for other services running in the same environment, such as a MyBinder container.

The jupyter-server-proxy package represents a generalisation of earlier demonstrations that showed how to proxy RStudio and OpenRefine in a MyBinder container.

In this post, I’ll serve up several ways of getting OpenRefine running in a MyBinder container. You can find all the examples in branches off psychemedia/jupyterserverproxy-openrefine.

Early original work on getting OpenRefine running in MyBinder was done by @betatim (betatim/openrefineder) using an earlier package, nbserverproxy; @yuvipanda helped me get my head round various bits of jupyterhub/jupyter-server-proxy/ which is key to proxying web services via Jupyter (jupyter-server-proxy docs). @manics provided the jupyter-server-proxy PR for handling predefined, rather than allocated, port mappings, which also made life much easier…

Common Installation Requirements

The following steps are pretty much common to all the recipes, and are responsibly for installing OpenRefine and its dependencies.

First, in a binder/apt.txt file, the Java dependency:

openjdk-8-jre

A binder/postBuild step to install a specific version of OpenRefine:

#!/bin/bash
set -e

VERSION=2.8
wget -q -O openrefine-$VERSION.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/$VERSION/openrefine-linux-$VERSION.tar.gz
mkdir -p $HOME/.openrefine
tar xzf openrefine-$VERSION.tar.gz -C $HOME/.openrefine
rm openrefine-$VERSION.tar.gz

mkdir -p $HOME/openrefine

A binder/requirements.txt file to install the Python OpenRefine API client:

git+https://github.com/dbutlerdb/refine-client-py

Note that this a fork of the original client that supports Python 3. It works with OpenRefine 2.8 but I’m not sure if it works properly with OpenRefine 3. There are multiple forks of the client and from what I can tell they are differently broken. It would be great of OpenRefine repo took on one fork as the official client that everyone could contribute to.

start definition – Autostarting headless OpenRefine Server

Binder

This start branch (repo) demonstrates:

  • using a binder/start file to auto start OpenRefine;
  • a notebook/client demo; this essentially runs in a headless mode.

The binder/start file extend the MyBinder start CMD to run the commands included in the file in addition to the default command to start the Jupyter notebook server. (The binder/start file can also be used to run things like the setting of environment variables. I’m not sure how to make available an environment variable defined in binder/postBuild inside binder/start?)

#!/bin/bash

#Start OpenRefine
OPENREFINE_DIR="$HOME/openrefine"
mkdir -p $OPENREFINE_DIR
nohup $HOME/.openrefine/openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

exec "$@"

In this demo, you won’t be able to see the OpenRefine GUI using this demo. Instead, you can access it via its API using an OpenRefine python client. An included notebook gives a worked example (note that at the moment you can’t run the first few parts of the demo because they assume the presence of a pre-existing OpenRefine project. Instructions appear further down the notebook for creating a project and working with it using the API client; I’ll do a separate post on the OpenRefine Python client at some point…)

simpleproxy definition

The simpleproxy branch (repo) extends the start branch with a proxy that can be used to render the OpenRefine GUI.

The binder/requirements.txt needs an additional package — the jupyter-server-proxy package. (I’m using the repo version because at the time of writing the PyPi released version doesn’t include all the features we need…)

git+https://github.com/dbutlerdb/refine-client-py
git+https://github.com/jupyterhub/jupyter-server-proxy

If you launch the Binder, it uses the serverproxy to proxy the OpenRefine port to proxy/3333/; note that the trailing slash is important. Without it, the static files (CSS etc) required to render the page are not resolved correctly.

traitlet-nolab definition

The traitlet-nolab branch (repo) uses the traitlet method (docs) to add a menu option to the Jupyter notebook homepage that allows OpenRefine to be started and launched from the notebook home New menu.

OpenRefine will also be started automatically if you start the MyBinder container with ?urlpath=openrefine or navigate directly to http://MYBINDERURL/openrefine.

Start on Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

In this case, the binder/start invocation is not required.

Once started, OpenRefine will appear on a named proxy path, openrefine (the slash may be omitted in this case).

The traitlet is defined in a jupyter_notebook_config.py file:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'title': 'OpenRefine'
        },
    },
}

This is copied into the correct location by an additional binder/postBuild step:

mkdir -p $HOME/.jupyter/

#Although located in binder/,
# this bash file runs in $HOME rather than $HOME/binder
mv jupyter_notebook_config.py $HOME/.jupyter/

The traitlet definition file is loaded in as a notebook server configuration file prior to starting the notebook server.

Note that the definition file uses the port: 3333 attribute to explicitly set the port that the server will be served against. If this is omitted, then a port will be dynamically allocated by the proxy server. In the case of OpenRefine, I am defining a port explicitly so that the Python API client can connect to it directly on the assumed default port 3333.

Note that if we try to use the Python client without starting the OpenRefine server by launching it, the connection will fail because there will be no running OpenRefine server for the client to connect to.

Python package setup definition

The setup branch (repo) demonstrates:

  • using serverproxy (setup definition (docs)) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assignment once again so that we can work with the client package using default port settings.

Start in Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

For this build, we go back to the base setup (no binder/start, not traitlet definition files) and add a setup.py file:

import setuptools

setuptools.setup(
  name="jupyter-openrefine-server",
  # py_modules rather than packages, since we only have 1 file
  py_modules=['openrefine'],
  entry_points={
      'jupyter_serverproxy_servers': [
          # name = packagename:function_name
          'openrefine = openrefine:setup_openrefine',
      ]
  },
)

This calls on an openrefine.py file to define the configuration:

import os

def setup_openrefine():
  path = os.path.join(os.environ['HOME'], 'openrefine')
  return {
    'command': ['$HOME/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d',path],
    'port': 3333,
    'launcher_entry': {
        'title': 'OpenRefine',
    },
  }

As before, an OpenRefine option is added to the start menu and can be used to start the OpenRefine server and launch the UI client on the path openrefine. (As we started the server on a known port, can also find it explictly at proxy/3333.)

Calling the aliased URL directly will also start the server. This means we can tell MyBinder to open on the openrefine path (or add ?urlpath=openrefine to the Binder URL) and the container will open into the OpenRefine application.

Once again, we need to launch the OpenRefine app before we can connect to it from the Python client.

master branch – traitlet definition, Notebook and JupyterLab Support

The master branch (repo) builds on the traitlet definition branch and demonstrates:

  • using serverproxy (traitlet definition) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assigment so that we can work with the client package.
  • a button is also enabled and added to the JupyterLab launcher.

OpenRefine can now be started and launched from the notebook homepage New menu or from the JupyterLab launcher, via a ?urlpath=openrefine MyBinder luanch invocation, or by navigating directly to the proxied path openrefine.

Open to Notebook homepage: Binder

Open to OpenRefine: Binder

Open to Jupyterlab: Binder

In this case, we need to enable the JupyterLab extension with the following addition to the binder/postBuild file:

#Enable the OpenRefine icon in JuptyerLab desktop launcher
jupyter labextension install jupyterlab-server-proxy

This will enable a default start button in the JupyterLab launcher.

We can also provide an icon for the start button. Further modify the binder/postBuild file to copy the logo to a desired location:

#Although located in binder/,
mv open-refine-logo.svg $HOME/.jupyter/

and modify the jupyter_notebook_config.py by with the addition of a path to the start logo, also ensuring that the launcher entry is enabled:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'enabled': True,
            'icon_path': '/home/jovyan/.jupyter/open-refine-logo.svg',
            'title': 'OpenRefine',
        },
    },
}

We should now see a start button for OpenRefine in the JupyterLab launcher.

Clicking on the button will autostart the server an open a browser tab onto the OpenRefine application GUI.

Summary

Running OpenRefine in MyBinder using JupyterServerProxy allows us to use OpenRefine as part of a shareable, on demand, serverless, Jupyter mediated workbench, defined via a public Github repository.

As well as being access as a GUI application, and via the Python API client, OpenRefine can also connect to a PostgreSQL server running inside the MyBinder container. For running PostgreSQL inside MyBinder, see Running a PostgreSQL Server in a MyBinder Container; for connecting OpenRefine to a Postgres server running in the same MyBinder container, see OpenRefine Database Connections in MyBinder.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...