Not Quite Serverless Reproducible Environments With Digital Ocean

I’ve been on something of a code rollercoaster over the last few days, fumbling with Jupyter server proxy settings in MyBinder, and fighting with OpenRefine, but I think I’ve stumbled on a new pattern that could be quite handy. In fact, I think it’s frickin’ ace, even though it’s just another riff, another iteration, on what we’ve been able to do before (e.g. along the lines of deploy to tutum, Heroku or zeit.now).

(It’s probably nothing that wasn’t out there already, but I have seen the light that issues forth from it.)

One of the great things about MyBinder is that it helps make you work reproducible. There’s a good, practical, review of why in this presentation: Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud. One of the presentation claims is that it costs about $5000 a month to run MyBinder, covered at the moment by the project, but following the reference link I couldn’t see that number anywhere. What I did see, though, was something worth bearing in mind: “[MyBinder] users are guaranteed at least 1GB of RAM, with a maximum of 2GB”.

For OpenRefine running in MyBinder, along with other services, this shows why it may struggle at times…

So, how can we help?

And how can we get around the fact of not knowing what other stuff repo2docker, the build agent for MyBinder, might be putting into the server, or not being able to use Docker compose to link across several services in the Binderised environment, or having to run MyBinder containers in public (although it looks as though evil Binder auth in general may now be available?)

One way would be for institutions to chip in the readies to help keep the public MyBinder service free. Another way could be a sprint on a federated Binderhub in which institutions could chip in server resource. Another would be for institutions to host their own Binderhub instances, either publicly available or just available to registered users. (In the latter case, it would be good if the institutions also contributed developer effort, code, docs or community development back to the Jupyter project as a whole.)

Alternatively, we can roll our own server. But setting up a Binderhub instance is not necessarily the easiest of tasks (I’ve yet to try it…) and isn’t really the sort of thing your average postgrad or data journalist who wants to run a Binderised environment should be expected to have to do.

So what’s the alternative?

To my mind, Binderhub offers a serverless sort of experience, though that’s not to say no servers are involved. The model is that I can pick a repo, click a button, a server is started, my image built and a container launched, and I get to use the resulting environment. Find repo. Click button. Use environment. The servery stuff in the middle — the provisioning of a physical web server and the building of the environment — that’s nothing I have to worry about. It’s seamless and serverless.

Another thing to note is that MyBinder use cases are temporary / ephemeral. Launch Binderised app. Use it. Kill it.

This stands in contrast to setting services running for extended periods of time, and managing them over that period, which is probably what you’d want to do if you ran your own Binderhub instance. I’m not really interested in that. I want to: launch my environment; use it; kill it. (I keep trying to find a way of explaining this “personal application server” position clearly, but not with much success so far…)

So here’s where I’m at now: a nearly, or not quite, serverless solution; a bring your own server approach, in fact, using Digital Ocean, which is the easiest cloud hosting solution I’ve found for the sorts of things I want to do.

Its based around the User data text box in a Digital Ocean droplet creation page:

If you pop a shell script in there, it will run the code that appears in that box once the server is started.

But what code?

That’s the pattern I’ve started exploring.

Something like this:

#!/bin/bash

#Optionally:
export JUPYTER_TOKEN=myOwnPA5%w0rD

#Optionally:
#export REFINEVERSION=2.8

GIST=d67e7de29a2d012183681778662ef4b6
git clone https://gist.github.com/$GIST.git
cd $GIST
docker-compose up -d

which will grab a script (saved as a set of files in a public gist) to download and install an OpenRefine server inside a token protected Jupyter notebook server (the OpenRefine server runs via a Jupyter server proxy (see also OpenRefine Running in MyBinder, Several Ways… for various ways of running OpenRefine behind a Jupyter server proxy in MyBinder).

Or this (original gist):

#!/bin/bash

#Optionally:
#export JUPYTER_TOKEN=myOwnPA5%w0rD

GIST=8fa117e34c62b7f80b6c595b8ba4f488

git clone https://gist.github.com/$GIST.git
cd $GIST

docker-compose up -d

that will download and install a docker-compose set of elements:

  • a Jupyter notebook server, seeded with a demo notebook and various SQL magics;
  • a Postgres server (empty; I really need to add a fragment showing how to seed it with data; or you should be able to figure it out from here: Running a PostgreSQL Server in a MyBinder Container);
  • an AgensGraph server; AgensGraph is a graph database built on Postgres. The demo notebook currently uses the first part of the AgensGraph tutorial to show how to load data into it.

(The latter example includes a zip file that you can’t upload via the Gist web interface; so here’s a recipe for adding binary files (zip files, image files) to a Github gist.)

So what do you need to do to get the above environments up and running?

  • go to Digital Ocean](https://www.digitalocean.com/) (this link will get you $100 credit if you need to create an account);
  • create a droplet;
  • select ‘one-click’ type, and the docker flavour;
  • select a server size (get the cheap 3GB server and the demos will be fine);
  • select a region (or don’t); I generally go for London cos I figure it’s locallest;
  • check the User data check box and paste in one of the above recipes (make sure you start it with no spare lines at the top with the hashbang (#!/bin/bash);
  • optionally name the image (for your convenience and lack of admin panel eyesoreness);
  • click create;
  • copy the IP address of the server that’s created;
  • after 3 or 4 minutes (it may take some time to download the app containers into the server), paste the IP address into a browser location bar;
  • when presented with the Jupyter notebook login page, enter the default token (letmein; or the token you added in the User data script, if you did), or use it to set a different password at the bottom of the login page;
  • use your apps…
  • destroy the droplet (so you don’t continue to pay for it).

If that sounds too hard / too many steps, there are some pictures to show you what to do in the Creating a Digital Ocean Docker Droplet section of this post.

It’s really not that hard…

Though it could be easier. For example, if we had a “deploy to Digital Ocean” button that took the something like the form: http://deploygist.digitalocean.com/GIST and that looked for user_data and maybe other metadata files (region, server size, etc) to set a server running on your account and then redirect you to the appropriate webpage.

We don’t need to rely on just web clients either. For example, here’s a recipe for Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code.

The next thing I need to think about is preserving state. This looks like it may be promising in that regard? Filestash [docs]. This might also be worth looking at: Pydio. (Or this: Grpahite h/t @simonperry.)

For anyone still still reading and still interested, here are some more reasons why I think this post is a useful one…

The linked gists are bother based around Docker deployments (it makes sense to use Docker, I think, because a lot of hard to install software is already packaged in Docker containers), although they demonstrate different techniques:

  • the first (OpenRefine) demo extends a Jupyter container so that it includes the OpenRefine server; OpenRefine then hides behind the Jupyter notebook auth and is proxied using Jupyter server proxy;
  • the second (AgensGraph) demo uses Docker compose. The database services are headless and not exposed.

What I have tried to do in the docker-compose.yml and Dockerfile files is show a variety of techniques for getting stuff done. I’ll comment them more liberally, or write a post about them, when I get chance. One thing I still need to do is a demo using nginx as a reverse proxy, with and without simple http auth. One thing I’m not sure how to do, if indeed it’s doable, is proxy services from a separate container using Jupyter server proxy; nginx would provide a way around that.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

%d bloggers like this: