OpenRefine Database Connections in MyBinder

With the version 3.0 release of OpenRefine last year, database integration was introduced that allows data to be imported into OpenRefine from a connected database, or exported to a downloadable SQL datadump. (It doesn’t look like you can save/export data to a new database table in the connected database, or upsert the contents of a cleaned table). This was the release of the  OpenRefine Database Import Extension and the SqlDump export mentioned in this earlier post.

If you want to try it out, I’ve created a MyBinder / repo2docker configuration repo that will launch a MyBinder repo containing both a running OpenRefine server and a running PostgreSQL server, although the test table is very small…

For how to run Postgres in a MyBinder container, see Running a PostgreSQL Server in a MyBinder Container.

Start in OpenRefine client: Binder

Details are:

  • host: localhost
  • Port: 5432
  • User: testuser
  • Password: testpass
  • Database: testdb

There’s also a tiny seeded table in the database called quickdemo from which we can import data into OpenRefine:

I said it was a small table!

The rest of the db integration — SQL export — is described in the aforementioned post on OpenRefine’s SQL integration.

I have to admit I’m not sure what the workflow is? You’d typically want to put clean data into a database, rather than pull data from a database into OpenRefine for cleaning.

If you are using OpenRefine as a data cleaning tool, it would be useful to be able to export the data directly back into the connected database, either as an upserted table (as well as perhaps some row deletions) or as a new ..._clean table (“Upsert to database…”).

If you’re using OpenRefine as a data enrichment tool, being able to create a new, enriched table back in the connected database (“Export to database…”) would also make sense.

One of the things I’ll add to the to-do list is an example of how to export data from OpenRefine and then import it into the database using a simple Jupyter notebook script (a Jupyter notebook server is also running in the  MyBinder container (just delete the openrefine/ path from the MyBinder URL).

One of the new (to me) things I’ve spotted in OpenRefine 3 is the ability to export a Project Data Package. I mistakenly thought this might be something like a Frictionless Data data package format, but it looks to just be an export format for the OpenRefine project data? There are fields for import settings as well as descriptive metadata, but I don’t see any dialogues in the UI where you’d enter things like creator, contributors or description?

{
  "name": "clipboard",
  "tags": [],
  "created": "2019-02-09T22:13:42Z",
  "modified": "2019-02-09T22:14:31Z",
  "creator": "",
  "contributors": "",
  "subject": "",
  "description": "",
  "rowCount": 2,
  "title": "",
  "homepage": "",
  "image": "",
  "license": "",
  "version": "",
  "customMetadata": {},
  "importOptionMetadata": [
    {
      "guessCellValueTypes": false,
      "projectTags": [
        ""
      ],
      "ignoreLines": -1,
      "processQuotes": true,
      "fileSource": "(clipboard)",
      "encoding": "",
      "separator": ",",
      "storeBlankCellsAsNulls": true,
      "storeBlankRows": true,
      "skipDataLines": 0,
      "includeFileSources": false,
      "headerLines": 1,
      "limit": -1,
      "quoteCharacter": "\"",
      "projectName": "clipboard"
    }
  ],
  }
}

One of the column operations you can perform inOpenRefine is to cast columns to text, dates or numerics, but I don’t think that is saved as metadata anywhere? You can also define column types in the SQL exporter, but again, I’m not sure that then becomes project metadata. It’d be good to see these things unified a bit, and framing such a process in terms of supporting a tabular data package (with things like column typing specified) could be useful.

Another foil for this might be supporting a SQLite export format?

I have to admit I’m a bit confused as to how OpenRefine sits where in different workflows, particularly with data that is managed, and as such is most likely to be stored in some sort of database? (Lots of the OpenRefine tooling still harkens to a Linked Data future, so maybe it fits better in Linked Data workflows?). I also get the feeling that it shares a possible overlap with query engine tools such as Apache Drill, and maybe even document data extraction tools such as Apache Tika or Tabula. Again, seeing demonstrated toolchains and workflows in this area could be interesting.

Note to self: there are several other PDF table extractor tools out there alongside Tabula (Java) that I haven’t played with; eg R/pdftools, Python/Camelot and Python/pdfplumber.

Simon Willison is doing all sorts of useful stuff framing datasette as a datasette / SQLite ecosystem play. It could be useful to think a bit more about OpenRefine in terms of how it integrates with other data tools. For example, the X-to-sqlite tools help you start to structure variously formatted data sources in terms of a common SQLite representation, which can naturally incorporate things like column typing, but also the notion of database primary and foreign key columns. In a sense, OpenRefine provides a similar “import from anything-export to one format (CSV)” with a data cleaning step in the middle, but CSV is really informally structured in terms of its self-descriptive representation.

One of the insights I had when revising our TM351 relational database notebooks was that database table constraints can play a really useful role when helping clean a dataset by automatically identifying things that are wrong with it… I’ll maybe try to demonstrate an OpenRefine / Jupyter notebook hybrid workflow around that too…

By the by, I noticed this post the other day Exploring the dystopian future of a Javascript Gephi. Gephi, like OpenRefine, is a Java app, and like OpenRefine is one I’ve never been tempted to doodle with code wise for a couple of reasons: a) Java doesn’t appeal to me as a language; b) I don’t have a Java environment to hand, and the thought of trying to set up an environment, and all the build tools, as a novice, for a complex legacy project just leaves me cold. As the Gephi developers see it, “[w]e have to face it: the multiplatform is moving from Java to web technologies. Oracle wants a Java that powers backends, not a user interface framework.”

I’ve dabbled with OpenRefine off and on for years now, and whiles its browser accessibility is really handy, the docs could do with some attention (I guess that’s something I could make a positive contribution to). Also, if it was a Python, rather than Java, application, I’d be more comfortable with it and would possibly start to poke around inside it a bit….

I guess one of the things can do (though I’ve never really had to push it) is scale with larger datasets, although the memory overhead may then become an issue? I the the R/Pandas crossover folk have been doing a lot of work on efficient datatable representations and scaleable tabular data interchange formats, and I’m not sure if OpenRefine is/will draw on any of that work?

It’s also been some time since I looked at Workbench, (indeed, I haven’t really looked at it since I posted an early review), but a quick peek at the repo shows a fair amount of activity. Maybe I should look at it again…?

OpenRefine Running in MyBinder, Several Ways…

Python packages such as the Jupyter Server  Proxy allow you to use a Jupyter notebook server as a proxy for other services running in the same environment, such as a MyBinder container.

The jupyter-server-proxy package represents a generalisation of earlier demonstrations that showed how to proxy RStudio and OpenRefine in a MyBinder container.

In this post, I’ll serve up several ways of getting OpenRefine running in a MyBinder container. You can find all the examples in branches off psychemedia/jupyterserverproxy-openrefine.

Early original work on getting OpenRefine running in MyBinder was done by @betatim (betatim/openrefineder) using an earlier package, nbserverproxy; @yuvipanda helped me get my head round various bits of jupyterhub/jupyter-server-proxy/ which is key to proxying web services via Jupyter (jupyter-server-proxy docs). @manics provided the jupyter-server-proxy PR for handling predefined, rather than allocated, port mappings, which also made life much easier…

Common Installation Requirements

The following steps are pretty much common to all the recipes, and are responsibly for installing OpenRefine and its dependencies.

First, in a binder/apt.txt file, the Java dependency:

openjdk-8-jre

A binder/postBuild step to install a specific version of OpenRefine:

#!/bin/bash
set -e

VERSION=2.8
wget -q -O openrefine-$VERSION.tar.gz https://github.com/OpenRefine/OpenRefine/releases/download/$VERSION/openrefine-linux-$VERSION.tar.gz
mkdir -p $HOME/.openrefine
tar xzf openrefine-$VERSION.tar.gz -C $HOME/.openrefine
rm openrefine-$VERSION.tar.gz

mkdir -p $HOME/openrefine

A binder/requirements.txt file to install the Python OpenRefine API client:

git+https://github.com/dbutlerdb/refine-client-py

Note that this a fork of the original client that supports Python 3. It works with OpenRefine 2.8 but I’m not sure if it works properly with OpenRefine 3. There are multiple forks of the client and from what I can tell they are differently broken. It would be great of OpenRefine repo took on one fork as the official client that everyone could contribute to.

start definition – Autostarting headless OpenRefine Server

Binder

This start branch (repo) demonstrates:

  • using a binder/start file to auto start OpenRefine;
  • a notebook/client demo; this essentially runs in a headless mode.

The binder/start file extend the MyBinder start CMD to run the commands included in the file in addition to the default command to start the Jupyter notebook server. (The binder/start file can also be used to run things like the setting of environment variables. I’m not sure how to make available an environment variable defined in binder/postBuild inside binder/start?)

#!/bin/bash

#Start OpenRefine
OPENREFINE_DIR="$HOME/openrefine"
mkdir -p $OPENREFINE_DIR
nohup $HOME/.openrefine/openrefine-2.8/refine -p 3333 -d OPENREFINE_DIR > /dev/null 2>&1 &

exec "$@"

In this demo, you won’t be able to see the OpenRefine GUI using this demo. Instead, you can access it via its API using an OpenRefine python client. An included notebook gives a worked example (note that at the moment you can’t run the first few parts of the demo because they assume the presence of a pre-existing OpenRefine project. Instructions appear further down the notebook for creating a project and working with it using the API client; I’ll do a separate post on the OpenRefine Python client at some point…)

simpleproxy definition

The simpleproxy branch (repo) extends the start branch with a proxy that can be used to render the OpenRefine GUI.

The binder/requirements.txt needs an additional package — the jupyter-server-proxy package. (I’m using the repo version because at the time of writing the PyPi released version doesn’t include all the features we need…)

git+https://github.com/dbutlerdb/refine-client-py
git+https://github.com/jupyterhub/jupyter-server-proxy

If you launch the Binder, it uses the serverproxy to proxy the OpenRefine port to proxy/3333/; note that the trailing slash is important. Without it, the static files (CSS etc) required to render the page are not resolved correctly.

traitlet-nolab definition

The traitlet-nolab branch (repo) uses the traitlet method (docs) to add a menu option to the Jupyter notebook homepage that allows OpenRefine to be started and launched from the notebook home New menu.

OpenRefine will also be started automatically if you start the MyBinder container with ?urlpath=openrefine or navigate directly to http://MYBINDERURL/openrefine.

Start on Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

In this case, the binder/start invocation is not required.

Once started, OpenRefine will appear on a named proxy path, openrefine (the slash may be omitted in this case).

The traitlet is defined in a jupyter_notebook_config.py file:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'title': 'OpenRefine'
        },
    },
}

This is copied into the correct location by an additional binder/postBuild step:

mkdir -p $HOME/.jupyter/

#Although located in binder/,
# this bash file runs in $HOME rather than $HOME/binder
mv jupyter_notebook_config.py $HOME/.jupyter/

The traitlet definition file is loaded in as a notebook server configuration file prior to starting the notebook server.

Note that the definition file uses the port: 3333 attribute to explicitly set the port that the server will be served against. If this is omitted, then a port will be dynamically allocated by the proxy server. In the case of OpenRefine, I am defining a port explicitly so that the Python API client can connect to it directly on the assumed default port 3333.

Note that if we try to use the Python client without starting the OpenRefine server by launching it, the connection will fail because there will be no running OpenRefine server for the client to connect to.

Python package setup definition

The setup branch (repo) demonstrates:

  • using serverproxy (setup definition (docs)) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assignment once again so that we can work with the client package using default port settings.

Start in Jupyter notebook homepage: Binder

Start in OpenRefine client: Binder

For this build, we go back to the base setup (no binder/start, not traitlet definition files) and add a setup.py file:

import setuptools

setuptools.setup(
  name="jupyter-openrefine-server",
  # py_modules rather than packages, since we only have 1 file
  py_modules=['openrefine'],
  entry_points={
      'jupyter_serverproxy_servers': [
          # name = packagename:function_name
          'openrefine = openrefine:setup_openrefine',
      ]
  },
)

This calls on an openrefine.py file to define the configuration:

import os

def setup_openrefine():
  path = os.path.join(os.environ['HOME'], 'openrefine')
  return {
    'command': ['$HOME/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d',path],
    'port': 3333,
    'launcher_entry': {
        'title': 'OpenRefine',
    },
  }

As before, an OpenRefine option is added to the start menu and can be used to start the OpenRefine server and launch the UI client on the path openrefine. (As we started the server on a known port, can also find it explictly at proxy/3333.)

Calling the aliased URL directly will also start the server. This means we can tell MyBinder to open on the openrefine path (or add ?urlpath=openrefine to the Binder URL) and the container will open into the OpenRefine application.

Once again, we need to launch the OpenRefine app before we can connect to it from the Python client.

master branch – traitlet definition, Notebook and JupyterLab Support

The master branch (repo) builds on the traitlet definition branch and demonstrates:

  • using serverproxy (traitlet definition) to add an OpenRefine menu option to the notebook start menu. The configuration uses a fixed port assigment so that we can work with the client package.
  • a button is also enabled and added to the JupyterLab launcher.

OpenRefine can now be started and launched from the notebook homepage New menu or from the JupyterLab launcher, via a ?urlpath=openrefine MyBinder luanch invocation, or by navigating directly to the proxied path openrefine.

Open to Notebook homepage: Binder

Open to OpenRefine: Binder

Open to Jupyterlab: Binder

In this case, we need to enable the JupyterLab extension with the following addition to the binder/postBuild file:

#Enable the OpenRefine icon in JuptyerLab desktop launcher
jupyter labextension install jupyterlab-server-proxy

This will enable a default start button in the JupyterLab launcher.

We can also provide an icon for the start button. Further modify the binder/postBuild file to copy the logo to a desired location:

#Although located in binder/,
mv open-refine-logo.svg $HOME/.jupyter/

and modify the jupyter_notebook_config.py by with the addition of a path to the start logo, also ensuring that the launcher entry is enabled:

# Traitlet configuration file for jupyter-notebook.

c.ServerProxy.servers = {
    'openrefine': {
        'command': ['/home/jovyan/.openrefine/openrefine-2.8/refine', '-p', '{port}','-d','/home/jovyan/openrefine'],
        'port': 3333,
        'timeout': 120,
        'launcher_entry': {
            'enabled': True,
            'icon_path': '/home/jovyan/.jupyter/open-refine-logo.svg',
            'title': 'OpenRefine',
        },
    },
}

We should now see a start button for OpenRefine in the JupyterLab launcher.

Clicking on the button will autostart the server an open a browser tab onto the OpenRefine application GUI.

Summary

Running OpenRefine in MyBinder using JupyterServerProxy allows us to use OpenRefine as part of a shareable, on demand, serverless, Jupyter mediated workbench, defined via a public Github repository.

As well as being access as a GUI application, and via the Python API client, OpenRefine can also connect to a PostgreSQL server running inside the MyBinder container. For running PostgreSQL inside MyBinder, see Running a PostgreSQL Server in a MyBinder Container; for connecting OpenRefine to a Postgres server running in the same MyBinder container, see OpenRefine Database Connections in MyBinder.

OpenRefine Hangs On Start…

If you’re testing OpenRefine on things like Digital Ocean, use at least a 3GB server, ideally more.

If you use a 2GB server and default OpenRefine start settings, you may find it stalls on start and just hangs, particular if you are running it via a Docker container.

(I just wasted four, going on five, hours trying to debug what I thought were other issues, when all the time it was a poxy memory issue.)

So be warned: when testing Java apps in Docker containers / Docker Compose configurations, use max spec, not min spec machines.

I waste hours of my evenings and weekends on this sort of crap so you don’t have to… #ffs

Quick Review – Jupyter Multi Outputs Notebook Extension

This post represents a quick review of the Jupyter multi-outputs Jupyter notebook extension.

The extension is one of a series of extensions developed by the Japanese National Institute of Informatics (NII) Literate Computing for Reproducible Infrastructure project.

My feeling is that some of these notebook extensions may also be useful in an educational context for supporting teaching and learning activities within Jupyter notebooks, and I’ll try to post additional reviews of some of the other extensions.

So what does the multi-outputs extension offer?

Running cells highlights a running cell with light blue a successfully run cell (or one that runs to completion with a warning) green and one that fails to run to completion due to an error as pink/red.

If you reload the notebook in the browser, the colour highlighting is lost. Similarly if you close the notebook and then open it again (whether or not the kernel is kept alive or restarted). This nudges you towards always shutting down the kernel when you close a notebook, or always restarting the kernel if you reload a notebook page in the browser, if you want to benefit fully from the semantics associated with the cell colour highlighting.

We can also save the output of a cell into a tab identified by the cell execution number. Once the cell is run, click on the pin item in the left hand margin to save that cell output:

The output is saved into a tab numbered according to the cell execution count number. You can now run the cell again:

and click on the previously saved output tab. You may notice that when you select a previous output tab that a left/right arrow “show differences” icon appears:

Click on that output to compare the current and previous outputs:

(I find the graphic display a little confusing, but’s typical for many differs! If you look closely, you may seen green (addition) and red (deletion) highlighting.)

The differ display also supports simple search (you need to hit Return to register the search term as such.)

The saved output is actually saved as notebook metadata associated with the cell, which means it will persist when the notebook is closed and restarted at a later date.

One of the hacky tools I’ve got in tm351_utils (which really needs some example notebooks…) is a simple differencing display. I’m not sure if any of the TM351 notebooks I suggested during the last round of revisions that used the differ made it into the finally released notebooks, but it might be worth comparing that approach, of diffing across the outputs of two cells, with this approach, of diffing between two outputs from the same cell run at different times/with different parameters/state.

Config settings appear to be limited to the maximum number of saved / historical tabs per cell:

So, useful? I’ll try to work up some education related examples. (If you have any ideas for some, or have already identified and/or demonstrated some, please let me know via the comments.)

Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code

Despite seeing talk of Jupyter notebook integration in Microsoft Visual Studio (VS) Code, I didn’t do much more than pass it on (via the Tracking Juptyer newsletter) because I though it was part of a heavyweight Visual Studio IDE.

Not so.

Microsoft Visual Studio Code is an electron app, reminiscent-ish of Atom editor (maybe?) that’s available as a quite compact download across Windows, Mac and Linux platforms.

Navigating the VS Code UI is probably the hardest part of connecting it to a Jupyter kernel, remote or local, so let’s see what’s involved.

If you haven’t got VS Code installed, you’ll need to download and install it.

Install the Python extension and reload…

Now let’s go hunting for the connection dialogue…

From the Command Palette, search for Python: Specify Jupyter server URI (there may be an easier way: I’ve spent all of five minutes with this environment!):

You’ll be prompted with another dialogue. Select the Type in the URI to connect to a running Jupyter server option:

and you’ll be prompted for a URI. But what URI?

Let’s launch a Digital Ocean server.

If you don’t have a Digital Ocean account you can create one here and get $100 free credit, which is way more than enough for this demo.

Creating a server is quite straightforward. There’s an example recipe here — you’ll need to create as one click app a Docker server, select your region and server size (a cheap 2GB server will be plenty), and then enter the following into the User data area:

#!/bin/bash

docker run -d --rm -p 80:8888 -e JUPYTER_TOKEN=letmein jupyter/minimal-notebook

You can now create your server (optionally naming it for convenience):

The server will be launched and after a moment or two it will be assigned a public IP address. Copy this address and paste it into a browser location bar — this is just to help us monitor when the Jupyter server is ready (it will probably take a minute or two to download and install the notebook container into the server).

When you see the notebook server (no need to log in, unless you want to; the token is letmein, or whatever you set it to in the User data form), you can enter the following into the VS Code server URI form using the IP address of your server:

http://IPDDRESS?token=letmein

In VS Code, raise the Command Palette… again and start to search for Pythin: Show Python Interactive window.

When you select it, a new interactive Python tab will be opened, connected to the remote server.

You should now be able to interact with your remote IPython kernel running on a Digital Ocean server.

See Working with Jupyter Notebooks in Visual Studio Code for some ideas of what to do next… (I should probably work through this too…)

If you want to change the remote Jupyter kernel URL, you either need to quit VS Code, restart it, and go through the adding a connection URI process again, or dip into the preferences (h/t Nick H. in the TM351 forums for that spot):

When you’re done, go back to the Digital Ocean control panel and destroy the droplet you created. If you don’t, you’ll continue to be billed at its hourly rate for each hour, or part thereof, that you keep it around (switched or not; there’s still a rental charge… If you treat the servers as temporary servers, and destroy them when you’re done, your $100 can go a long way…)

Adding Zip Files to Github Gists

Over the years, I’ve regularly used Github gists as a convenient place to post code fragments. Using the web UI, it’s easy enough to create new text files. But how do you add images, or zip files to a gist..?

…because there’s no way I can see of doing it using the web UI?

But a gist is just a git repo, so we should be able to commit binary files to it.

Ish via this gist — How to add an image to a gist — this simple recipe. It requires that you have git installed on your computer…

#YOURGISTID is in the gist URL: https://gist.github.com/USERNAME/GISTID
GIST=GISTID

git clone https://gist.github.com/$GIST.git

#MYBINARYFILENAME is something like mydata.zip or myimage.zip
cp MYBINARYPATH/MYBINARYFILENAME $GIST/

cd $GIST
git add MYBINARYFILENAME

git commit -m MYCOMMITMESSAGE

git push origin master
#If prompted, provide your Github credentials associated with the gist

Handy…

Not Quite Serverless Reproducible Environments With Digital Ocean

I’ve been on something of a code rollercoaster over the last few days, fumbling with Jupyter server proxy settings in MyBinder, and fighting with OpenRefine, but I think I’ve stumbled on a new pattern that could be quite handy. In fact, I think it’s frickin’ ace, even though it’s just another riff, another iteration, on what we’ve been able to do before (e.g. along the lines of deploy to tutum, Heroku or zeit.now).

(It’s probably nothing that wasn’t out there already, but I have seen the light that issues forth from it.)

One of the great things about MyBinder is that it helps make you work reproducible. There’s a good, practical, review of why in this presentation: Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud. One of the presentation claims is that it costs about $5000 a month to run MyBinder, covered at the moment by the project, but following the reference link I couldn’t see that number anywhere. What I did see, though, was something worth bearing in mind: “[MyBinder] users are guaranteed at least 1GB of RAM, with a maximum of 2GB”.

For OpenRefine running in MyBinder, along with other services, this shows why it may struggle at times…

So, how can we help?

And how can we get around the fact of not knowing what other stuff repo2docker, the build agent for MyBinder, might be putting into the server, or not being able to use Docker compose to link across several services in the Binderised environment, or having to run MyBinder containers in public (although it looks as though evil Binder auth in general may now be available?)

One way would be for institutions to chip in the readies to help keep the public MyBinder service free. Another way could be a sprint on a federated Binderhub in which institutions could chip in server resource. Another would be for institutions to host their own Binderhub instances, either publicly available or just available to registered users. (In the latter case, it would be good if the institutions also contributed developer effort, code, docs or community development back to the Jupyter project as a whole.)

Alternatively, we can roll our own server. But setting up a Binderhub instance is not necessarily the easiest of tasks (I’ve yet to try it…) and isn’t really the sort of thing your average postgrad or data journalist who wants to run a Binderised environment should be expected to have to do.

So what’s the alternative?

To my mind, Binderhub offers a serverless sort of experience, though that’s not to say no servers are involved. The model is that I can pick a repo, click a button, a server is started, my image built and a container launched, and I get to use the resulting environment. Find repo. Click button. Use environment. The servery stuff in the middle — the provisioning of a physical web server and the building of the environment — that’s nothing I have to worry about. It’s seamless and serverless.

Another thing to note is that MyBinder use cases are temporary / ephemeral. Launch Binderised app. Use it. Kill it.

This stands in contrast to setting services running for extended periods of time, and managing them over that period, which is probably what you’d want to do if you ran your own Binderhub instance. I’m not really interested in that. I want to: launch my environment; use it; kill it. (I keep trying to find a way of explaining this “personal application server” position clearly, but not with much success so far…)

So here’s where I’m at now: a nearly, or not quite, serverless solution; a bring your own server approach, in fact, using Digital Ocean, which is the easiest cloud hosting solution I’ve found for the sorts of things I want to do.

Its based around the User data text box in a Digital Ocean droplet creation page:

If you pop a shell script in there, it will run the code that appears in that box once the server is started.

But what code?

That’s the pattern I’ve started exploring.

Something like this:

#!/bin/bash

#Optionally:
export JUPYTER_TOKEN=myOwnPA5%w0rD

#Optionally:
#export REFINEVERSION=2.8

GIST=d67e7de29a2d012183681778662ef4b6
git clone https://gist.github.com/$GIST.git
cd $GIST
docker-compose up -d

which will grab a script (saved as a set of files in a public gist) to download and install an OpenRefine server inside a token protected Jupyter notebook server (the OpenRefine server runs via a Jupyter server proxy (see also OpenRefine Running in MyBinder, Several Ways… for various ways of running OpenRefine behind a Jupyter server proxy in MyBinder).

Or this (original gist):

#!/bin/bash

#Optionally:
#export JUPYTER_TOKEN=myOwnPA5%w0rD

GIST=8fa117e34c62b7f80b6c595b8ba4f488

git clone https://gist.github.com/$GIST.git
cd $GIST

docker-compose up -d

that will download and install a docker-compose set of elements:

  • a Jupyter notebook server, seeded with a demo notebook and various SQL magics;
  • a Postgres server (empty; I really need to add a fragment showing how to seed it with data; or you should be able to figure it out from here: Running a PostgreSQL Server in a MyBinder Container);
  • an AgensGraph server; AgensGraph is a graph database built on Postgres. The demo notebook currently uses the first part of the AgensGraph tutorial to show how to load data into it.

(The latter example includes a zip file that you can’t upload via the Gist web interface; so here’s a recipe for adding binary files (zip files, image files) to a Github gist.)

So what do you need to do to get the above environments up and running?

  • go to Digital Ocean](https://www.digitalocean.com/) (this link will get you $100 credit if you need to create an account);
  • create a droplet;
  • select ‘one-click’ type, and the docker flavour;
  • select a server size (get the cheap 3GB server and the demos will be fine);
  • select a region (or don’t); I generally go for London cos I figure it’s locallest;
  • check the User data check box and paste in one of the above recipes (make sure you start it with no spare lines at the top with the hashbang (#!/bin/bash);
  • optionally name the image (for your convenience and lack of admin panel eyesoreness);
  • click create;
  • copy the IP address of the server that’s created;
  • after 3 or 4 minutes (it may take some time to download the app containers into the server), paste the IP address into a browser location bar;
  • when presented with the Jupyter notebook login page, enter the default token (letmein; or the token you added in the User data script, if you did), or use it to set a different password at the bottom of the login page;
  • use your apps…
  • destroy the droplet (so you don’t continue to pay for it).

If that sounds too hard / too many steps, there are some pictures to show you what to do in the Creating a Digital Ocean Docker Droplet section of this post.

It’s really not that hard…

Though it could be easier. For example, if we had a “deploy to Digital Ocean” button that took the something like the form: http://deploygist.digitalocean.com/GIST and that looked for user_data and maybe other metadata files (region, server size, etc) to set a server running on your account and then redirect you to the appropriate webpage.

We don’t need to rely on just web clients either. For example, here’s a recipe for Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code.

The next thing I need to think about is preserving state. This looks like it may be promising in that regard? Filestash [docs]. This might also be worth looking at: Pydio. (Or this: Grpahite h/t @simonperry.)

For anyone still still reading and still interested, here are some more reasons why I think this post is a useful one…

The linked gists are bother based around Docker deployments (it makes sense to use Docker, I think, because a lot of hard to install software is already packaged in Docker containers), although they demonstrate different techniques:

  • the first (OpenRefine) demo extends a Jupyter container so that it includes the OpenRefine server; OpenRefine then hides behind the Jupyter notebook auth and is proxied using Jupyter server proxy;
  • the second (AgensGraph) demo uses Docker compose. The database services are headless and not exposed.

What I have tried to do in the docker-compose.yml and Dockerfile files is show a variety of techniques for getting stuff done. I’ll comment them more liberally, or write a post about them, when I get chance. One thing I still need to do is a demo using nginx as a reverse proxy, with and without simple http auth. One thing I’m not sure how to do, if indeed it’s doable, is proxy services from a separate container using Jupyter server proxy; nginx would provide a way around that.