From Charts to Interactive Apps With Jupyter Notebooks and IPywidgets…

One of the nice things about Jupyter notebooks is that once you’ve got some script in place to generate a particular sort of graphic, you can very easily turn it into a parameterised, widgetised app that lets you generate chart views at will.

For example, here’s an interactive take on a  WRC chartable(?!) I’ve been playing with today. Given a function that generates a table for a given stage, rebased against a specified driver, it takes only a few lines of code and some very straightforward widget definitions to create an interactive custom chart generating application around it:

In this case, I am dynamically populating the drivers list based on which class is selected. (Unfortunately, it only seems to work for RC1 and RC4 at the moment. But I can select drivers and stages within that class…)

It also struck me that we can add further controls to select which columns are displayed in the output chart:

What this means is that we can easily create simple applications capable of producing a wide variety of customised chart outputs. Such a tool might be useful for a sports journalist wanting to use different sorts of table to illustrate different sorts of sport report.

Tinkering with Stage Charts for WRC Rally Sweden

Picking up on some doodles I did around the Dakar 2019 rally, a quick review of a couple of chart types I’ve been tweeting today…

First up is a chart showing the evolution of the rally over the course of the first day.

overall_lap_ss8

This chart mixes metaphors a little…

The graphics relate directly to the driver identified by the row. The numbers are rebased relative to a particular driver (Lappi, in this case).

The first column, the stepped line chart, tracks overall position over the stages, the vertical bar chart next to it identifying the gap to the overall leader at the end of each stage (that is, how far behind the overall leader each driver is at the end of each stage). The green dot highlights that the driver was in overall lead of the rally at the end of that stage.

The SS_N_overall numbers are represent rebased overall times. So we see that at the end of SS2, MIK was 9s ahead of LAP overall, and LAP was 13.1 seconds ahead of LOE. The stagePosition stepped line shows how the driver specified by each row fared on each stage. The second vertical bar chart shows the time that driver lost compared to the stage winner; again, a green dot highlights a nominal first position, in this case stage wins. The SS_N numbers are once again rebased times, this time showing how much time the rebased driver gained (green) or lost (red) compared to the driver named on that row.

I still need to add a few channels into the stage report. The ones I have for WRC are still the basic ones without any inline charts, but the tables are a bit more normalised and I’d need to sit down and think through what I need to pull from where to best generate the appropriate rows and from them the charts…

Here’s a reminder of what a rebased single stage chart looks like: The first column is road position, the second the overall gap at the end of the previous stage. The first numeric columns are how far the rebased driver was ahead (green) or behind (red) each other driver at each split. The Overall* column is the gap at the end of the stage (I should rename this and drop the Overall* or maybe replace as Final; then overall position and overall rally time delta (i.e. the column that take on the role of Previous column in the next stage). The DN columns are the time gained/lost going between split points. This  often highlights any particularly good or bad parts of the stage.  For example, in the above example, rebased on Lappi, the first split was dreadful but then he was fastest going between splits 1 and 2, and fared well 2-3 and 3-4.

Not Quite Serverless Reproducible Environments With Digital Ocean

I’ve been on something of a code rollercoaster over the last few days, fumbling with Jupyter server proxy settings in MyBinder, and fighting with OpenRefine, but I think I’ve stumbled on a new pattern that could be quite handy. In fact, I think it’s frickin’ ace, even though it’s just another riff, another iteration, on what we’ve been able to do before (e.g. along the lines of deploy to tutum, Heroku or zeit.now).

(It’s probably nothing that wasn’t out there already, but I have seen the light that issues forth from it.)

One of the great things about MyBinder is that it helps make you work reproducible. There’s a good, practical, review of why in this presentation: Reproducible, Reliable, Reusable Analyses with BinderHub on Cloud. One of the presentation claims is that it costs about $5000 a month to run MyBinder, covered at the moment by the project, but following the reference link I couldn’t see that number anywhere. What I did see, though, was something worth bearing in mind: “[MyBinder] users are guaranteed at least 1GB of RAM, with a maximum of 2GB”.

For OpenRefine running in MyBinder, along with other services, this shows why it may struggle at times…

So, how can we help?

And how can we get around the fact of not knowing what other stuff repo2docker, the build agent for MyBinder, might be putting into the server, or not being able to use Docker compose to link across several services in the Binderised environment, or having to run MyBinder containers in public (although it looks as though evil Binder auth in general may now be available?)

One way would be for institutions to chip in the readies to help keep the public MyBinder service free. Another way could be a sprint on a federated Binderhub in which institutions could chip in server resource. Another would be for institutions to host their own Binderhub instances, either publicly available or just available to registered users. (In the latter case, it would be good if the institutions also contributed developer effort, code, docs or community development back to the Jupyter project as a whole.)

Alternatively, we can roll our own server. But setting up a Binderhub instance is not necessarily the easiest of tasks (I’ve yet to try it…) and isn’t really the sort of thing your average postgrad or data journalist who wants to run a Binderised environment should be expected to have to do.

So what’s the alternative?

To my mind, Binderhub offers a serverless sort of experience, though that’s not to say no servers are involved. The model is that I can pick a repo, click a button, a server is started, my image built and a container launched, and I get to use the resulting environment. Find repo. Click button. Use environment. The servery stuff in the middle — the provisioning of a physical web server and the building of the environment — that’s nothing I have to worry about. It’s seamless and serverless.

Another thing to note is that MyBinder use cases are temporary / ephemeral. Launch Binderised app. Use it. Kill it.

This stands in contrast to setting services running for extended periods of time, and managing them over that period, which is probably what you’d want to do if you ran your own Binderhub instance. I’m not really interested in that. I want to: launch my environment; use it; kill it. (I keep trying to find a way of explaining this “personal application server” position clearly, but not with much success so far…)

So here’s where I’m at now: a nearly, or not quite, serverless solution; a bring your own server approach, in fact, using Digital Ocean, which is the easiest cloud hosting solution I’ve found for the sorts of things I want to do.

Its based around the User data text box in a Digital Ocean droplet creation page:

If you pop a shell script in there, it will run the code that appears in that box once the server is started.

But what code?

That’s the pattern I’ve started exploring.

Something like this:

#!/bin/bash

#Optionally:
export JUPYTER_TOKEN=myOwnPA5%w0rD

#Optionally:
#export REFINEVERSION=2.8

GIST=d67e7de29a2d012183681778662ef4b6
git clone https://gist.github.com/$GIST.git
cd $GIST
docker-compose up -d

which will grab a script (saved as a set of files in a public gist) to download and install an OpenRefine server inside a token protected Jupyter notebook server (the OpenRefine server runs via a Jupyter server proxy (see also OpenRefine Running in MyBinder, Several Ways… for various ways of running OpenRefine behind a Jupyter server proxy in MyBinder).

Or this (original gist):

#!/bin/bash

#Optionally:
#export JUPYTER_TOKEN=myOwnPA5%w0rD

GIST=8fa117e34c62b7f80b6c595b8ba4f488

git clone https://gist.github.com/$GIST.git
cd $GIST

docker-compose up -d

that will download and install a docker-compose set of elements:

  • a Jupyter notebook server, seeded with a demo notebook and various SQL magics;
  • a Postgres server (empty; I really need to add a fragment showing how to seed it with data; or you should be able to figure it out from here: Running a PostgreSQL Server in a MyBinder Container);
  • an AgensGraph server; AgensGraph is a graph database built on Postgres. The demo notebook currently uses the first part of the AgensGraph tutorial to show how to load data into it.

(The latter example includes a zip file that you can’t upload via the Gist web interface; so here’s a recipe for adding binary files (zip files, image files) to a Github gist.)

So what do you need to do to get the above environments up and running?

  • go to Digital Ocean](https://www.digitalocean.com/) (this link will get you $100 credit if you need to create an account);
  • create a droplet;
  • select ‘one-click’ type, and the docker flavour;
  • select a server size (get the cheap 3GB server and the demos will be fine);
  • select a region (or don’t); I generally go for London cos I figure it’s locallest;
  • check the User data check box and paste in one of the above recipes (make sure you start it with no spare lines at the top with the hashbang (#!/bin/bash);
  • optionally name the image (for your convenience and lack of admin panel eyesoreness);
  • click create;
  • copy the IP address of the server that’s created;
  • after 3 or 4 minutes (it may take some time to download the app containers into the server), paste the IP address into a browser location bar;
  • when presented with the Jupyter notebook login page, enter the default token (letmein; or the token you added in the User data script, if you did), or use it to set a different password at the bottom of the login page;
  • use your apps…
  • destroy the droplet (so you don’t continue to pay for it).

If that sounds too hard / too many steps, there are some pictures to show you what to do in the Creating a Digital Ocean Docker Droplet section of this post.

It’s really not that hard…

Though it could be easier. For example, if we had a “deploy to Digital Ocean” button that took the something like the form: http://deploygist.digitalocean.com/GIST and that looked for user_data and maybe other metadata files (region, server size, etc) to set a server running on your account and then redirect you to the appropriate webpage.

We don’t need to rely on just web clients either. For example, here’s a recipe for Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code.

The next thing I need to think about is preserving state. This looks like it may be promising in that regard? Filestash [docs]. This might also be worth looking at: Pydio. (Or this: Grpahite h/t @simonperry.)

For anyone still still reading and still interested, here are some more reasons why I think this post is a useful one…

The linked gists are bother based around Docker deployments (it makes sense to use Docker, I think, because a lot of hard to install software is already packaged in Docker containers), although they demonstrate different techniques:

  • the first (OpenRefine) demo extends a Jupyter container so that it includes the OpenRefine server; OpenRefine then hides behind the Jupyter notebook auth and is proxied using Jupyter server proxy;
  • the second (AgensGraph) demo uses Docker compose. The database services are headless and not exposed.

What I have tried to do in the docker-compose.yml and Dockerfile files is show a variety of techniques for getting stuff done. I’ll comment them more liberally, or write a post about them, when I get chance. One thing I still need to do is a demo using nginx as a reverse proxy, with and without simple http auth. One thing I’m not sure how to do, if indeed it’s doable, is proxy services from a separate container using Jupyter server proxy; nginx would provide a way around that.

Adding Zip Files to Github Gists

Over the years, I’ve regularly used Github gists as a convenient place to post code fragments. Using the web UI, it’s easy enough to create new text files. But how do you add images, or zip files to a gist..?

…because there’s no way I can see of doing it using the web UI?

But a gist is just a git repo, so we should be able to commit binary files to it.

Ish via this gist — How to add an image to a gist — this simple recipe. It requires that you have git installed on your computer…

#YOURGISTID is in the gist URL: https://gist.github.com/USERNAME/GISTID
GIST=GISTID

git clone https://gist.github.com/$GIST.git

#MYBINARYFILENAME is something like mydata.zip or myimage.zip
cp MYBINARYPATH/MYBINARYFILENAME $GIST/

cd $GIST
git add MYBINARYFILENAME

git commit -m MYCOMMITMESSAGE

git push origin master
#If prompted, provide your Github credentials associated with the gist

Handy…

Connecting to a Remote Jupyter Notebook Server Running on Digital Ocean from Microsoft VS Code

Despite seeing talk of Jupyter notebook integration in Microsoft Visual Studio (VS) Code, I didn’t do much more than pass it on (via the Tracking Juptyer newsletter) because I though it was part of a heavyweight Visual Studio IDE.

Not so.

Microsoft Visual Studio Code is an electron app, reminiscent-ish of Atom editor (maybe?) that’s available as a quite compact download across Windows, Mac and Linux platforms.

Navigating the VS Code UI is probably the hardest part of connecting it to a Jupyter kernel, remote or local, so let’s see what’s involved.

If you haven’t got VS Code installed, you’ll need to download and install it.

Install the Python extension and reload…

Now let’s go hunting for the connection dialogue…

From the Command Palette, search for Python: Specify Jupyter server URI (there may be an easier way: I’ve spent all of five minutes with this environment!):

You’ll be prompted with another dialogue. Select the Type in the URI to connect to a running Jupyter server option:

and you’ll be prompted for a URI. But what URI?

Let’s launch a Digital Ocean server.

If you don’t have a Digital Ocean account you can create one here and get $100 free credit, which is way more than enough for this demo.

Creating a server is quite straightforward. There’s an example recipe here — you’ll need to create as one click app a Docker server, select your region and server size (a cheap 2GB server will be plenty), and then enter the following into the User data area:

#!/bin/bash

docker run -d --rm -p 80:8888 -e JUPYTER_TOKEN=letmein jupyter/minimal-notebook

You can now create your server (optionally naming it for convenience):

The server will be launched and after a moment or two it will be assigned a public IP address. Copy this address and paste it into a browser location bar — this is just to help us monitor when the Jupyter server is ready (it will probably take a minute or two to download and install the notebook container into the server).

When you see the notebook server (no need to log in, unless you want to; the token is letmein, or whatever you set it to in the User data form), you can enter the following into the VS Code server URI form using the IP address of your server:

http://IPDDRESS?token=letmein

In VS Code, raise the Command Palette… again and start to search for Pythin: Show Python Interactive window.

When you select it, a new interactive Python tab will be opened, connected to the remote server.

You should now be able to interact with your remote IPython kernel running on a Digital Ocean server.

See Working with Jupyter Notebooks in Visual Studio Code for some ideas of what to do next… (I should probably work through this too…)

If you want to change the remote Jupyter kernel URL, you either need to quit VS Code, restart it, and go through the adding a connection URI process again, or dip into the preferences (h/t Nick H. in the TM351 forums for that spot):

When you’re done, go back to the Digital Ocean control panel and destroy the droplet you created. If you don’t, you’ll continue to be billed at its hourly rate for each hour, or part thereof, that you keep it around (switched or not; there’s still a rental charge… If you treat the servers as temporary servers, and destroy them when you’re done, your $100 can go a long way…)

Quick Review – Jupyter Multi Outputs Notebook Extension

This post represents a quick review of the Jupyter multi-outputs Jupyter notebook extension.

The extension is one of a series of extensions developed by the Japanese National Institute of Informatics (NII) Literate Computing for Reproducible Infrastructure project.

My feeling is that some of these notebook extensions may also be useful in an educational context for supporting teaching and learning activities within Jupyter notebooks, and I’ll try to post additional reviews of some of the other extensions.

So what does the multi-outputs extension offer?

We can also save the output of a cell into a tab identified by the cell execution number. Once the cell is run, click on the pin item in the left hand margin to save that cell output:

The output is saved into a tab numbered according to the cell execution count number. You can now run the cell again:

and click on the previously saved output tab. You may notice that when you select a previous output tab that a left/right arrow “show differences” icon appears:

Click on that output to compare the current and previous outputs:

(I find the graphic display a little confusing, but’s typical for many differs! If you look closely, you may seen green (addition) and red (deletion) highlighting.)

The differ display also supports simple search (you need to hit Return to register the search term as such.)

The saved output is actually saved as notebook metadata associated with the cell, which means it will persist when the notebook is closed and restarted at a later date.

One of the hacky tools I’ve got in tm351_utils (which really needs some example notebooks…) is a simple differencing display. I’m not sure if any of the TM351 notebooks I suggested during the last round of revisions that used the differ made it into the finally released notebooks, but it might be worth comparing that approach, of diffing across the outputs of two cells, with this approach, of diffing between two outputs from the same cell run at different times/with different parameters/state.

Config settings appear to be limited to the maximum number of saved / historical tabs per cell:

So, useful? I’ll try to work up some education related examples. (If you have any ideas for some, or have already identified and/or demonstrated some, please let me know via the comments.)

OpenRefine Hangs On Start…

If you’re testing OpenRefine on things like Digital Ocean, use at least a 3GB server, ideally more.

If you use a 2GB server and default OpenRefine start settings, you may find it stalls on start and just hangs, particular if you are running it via a Docker container.

(I just wasted four, going on five, hours trying to debug what I thought were other issues, when all the time it was a poxy memory issue.)

So be warned: when testing Java apps in Docker containers / Docker Compose configurations, use max spec, not min spec machines.

I waste hours of my evenings and weekends on this sort of crap so you don’t have to… #ffs