Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack
IMAGESTUB=psychemedia/tm361testm

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base

#...

One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test
FROM ${BASE}

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

##Dockerfile
#Final tier Dockerfile
ARG BASE=psychemedia/testpieces
FROM ${BASE}

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf

EXPOSE 3334
EXPOSE 8888

CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##supervisord.conf
##We can check running processes under supervisord with: supervisorctl

[supervisord]
nodaemon=true
logfile=/dev/stdout
loglevel=trace
logfile_maxbytes=0
#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'
#https://github.com/jupyter/notebook/issues/1719
environment=HOME=/home/oustudent

[program:jupyternotebook]
#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)
directory=/notebooks
autostart=true
autorestart=true
startsecs=5
user=oustudent
stdout_logfile=NONE
stderr_logfile=NONE

[program:postgresql]
command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf
user=postgres
autostart=true
autorestart=true
startsecs=5

[program:mongodb]
command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351
user=mongodb
autostart=true
autorestart=true
startsecs=5

[program:openrefine]
command=/opt/openrefine-3.0-beta/refine -p 3334 -i 0.0.0.0 -d /vagrant/openrefine_projects
user=oustudent
autostart=true
autorestart=true
startsecs=5
stdout_logfile=NONE
stderr_logfile=NONE

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

Fragment: On Reproducible Open Educational Resources

Via O’Reilly’s daily Four Short Links feed, I notice the Open Logic Project, “a collection of teaching materials on mathematical logic aimed at a non-mathematical audience, intended for use in advanced logic courses as taught in many philosophy departments”. In particular, “it is open-source: you can download the LaTeX code [from Github]”.

However:

the TeX source does mean you need a (La)TeX environment to run it (and the project does bundle some of the custom .sty style files you need in the repo, which is handy).

Compare this with the Simple Maths Equations and Notation notebook I’ve started sketching as part of a self-started, informal “reproducible OERs with Jupyter notebooks” project I’m dabbling with:

Here, a Jupyter notebook contains LaTeX code can then be rendered (in part?) through the notebook previewer – at least in so far as expressions are written in Mathjax parseable code – and also within a live / running Jupyter notebook. Not only do I share the reproducible source code (as a notebook), I also share a link to at least one environment capable of running it, and that allows it to be reused with modification. (Okay, in this case, not openly so because you have to have an Azure Notebooks account. But the notebook could equally run on Binderhub or a local install, perhaps with one or two additional requirements if you don’t already run a scientific Python environment.)

In short, for a reproducible OER that supports reuse with modification, sharing the means of production also means sharing the machinery of production.

To simplify the user experience, the notebook environment can be preinstalled with packages needed to render a wider range of TeX code, such as drawings rendered using TikZ. Alternatively, code cells can be populated with package installation commands to custom a more vanilla environment, as I do in several demo notebooks:

What the Open Logic Project highlights is that reproducible OERs not only provide ready access to the “source code” of a resource so that it can be easily reused with modification, but that access to an open environment capable of processing that source code and rendering the output document also needs to be provided. (Open / reproducible science researchers have known this for some time…)

Getting a Tex/LateX environment up and running can be a faff – and can also take up a lot of disk space – so the runtime environment requirements are not negligible.

In the case of Jupyter notebooks, LateX support is available, and container images capable of running on Binderhub, for example, relatively easily defined (see for example the Binder LateX example). (I’m not sure how rich Stencila support for LaTeX is too, and/or whether it requires an external LaTeX environment when running the Stencila desktop app?)

It also strikes me that another thing we should be doing is export a copy of the finished work, eg as a PDF or complete, self-standing HTML archive, in case the machinery does break. This is also important where third party services are called. It may actually make sense to use something like requests for all third party URL requests, and save a cached version of all requests (using requests-cache) to provide a local copy of whatever it was that was called when originally flowing the document.

See also: OER Methods – Generative Designs for Reuse-With-Modification

An Easier Approach to Electrical Circuit Diagram Generation – lcapy

Whilst I might succumb in my crazier evangelical moments to the idea that academic authors (other than those who speak LateX natively) and media developers might engage in the raw circuitikz authoring described Reproducible Diagram Generators – First Thoughts on Electrical Circuit Diagrams, the reality is that it’s probably just way too clunky, and a little bit too far removed from the everyday diagrams educators are likely to want to create, to get much, if any, take up.

However, understanding something of the capabilities of quite low level drawing packages, and reflecting (as in the last post) on some of the strategies we might adopt for creating reusable, maintainable, revisable with modification and extensible diagram scripts puts us in good stead for looking out for more usable approaches.

One such example is the Python lcapy package, a linear circuit analysis package that supports:

  • the description of simple electrical circuits at a sloghtly hoger level than the raw circuitikz circuit creation model;
  • the rendering of the circuits, with a few layout cues, using circuitikz;
  • numerical analysis of the circuits in terms of response in time and frequency domains, and the charting of the results of the analysis; and
  • various forms of symbolic analysis of circuit descriptions in various domains.

Here are some quick examples to give a taste of what’s possible.

You can run the notebook (albeit subject to significant changes) that contains the original working for examples used in this post on Binderhub: Binder

Here’s a simple circuit:

And here’s how we can create it in lcapy from a netlist, annotated with cues for the underlying circuitikz generator about how to lay out the diagram.

from lcapy import Circuit

cct = Circuit()
cct.add("""
Vi 1 0_1 step 20; down
C 1 2; right, size=1.5
R 2 0; down
W 0_1 0; right
W 0 0_2; right, size=0.5
P1 2_2 0_2; down
W 2 2_2;right, size=0.5""")

cct.draw(style='american')

The things to the right of the semicolon on each line are the optional layout elements – they’re not required when defining the actual circuit itself.

The display of nodes and numbered nodes are all controllable, and the symbol styles are selectable between american, british and european stylings.

The lcapy/schematic.py package describes the various stylings as composites of circuitikz regionalisations, and could be easily extended to support a named house style, or perhaps accommodate a regionalisation passed in as an explicit argument value.

if style == 'american':
    style_args = 'american currents, american voltages'
elif style == 'british':
    style_args = 'american currents, european voltages'
elif style == 'european':
    style_args = ('european currents, european voltages, european inductors, european resistors')

As well as constructing circuits from netlist descriptions, we can also create them from network style descriptions:

from lcapy import R, C, L

cct2= (R(1e6) + L(2e-3)) | C(3e-6)
#For some reason, the style= argument is not respected
cct2.draw()

The diagrams generated from networks are open linear circuits rather than loops, which may not be quite what we want. But these circuits are quicker to write, so we can use them to draft netlists for us that we may then want to tidy up a bit further.

print(cct2.netlist())

'''
W 1 2; right=0.5
W 2 4; up=0.4
W 3 5; up=0.4
R 4 6 1000000.0; right
W 6 7; right=0.5
L 7 5 0.002; right
W 2 8; down=0.4
W 3 9; down=0.4
C 8 9 3e-06; right
W 3 0; right=0.5
'''

Circuit descriptions can also be loaded in from a named text file, which is handy for course material maintenance as well as reuse of circuits across materials: it’s easy enough to imagine a library of circuit descriptions.

#Create a file containing a circuit netlist
sch='''
Vi 1 0_1 {sin(t)}; down
R1 1 2 22e3; right, size=1.5
R2 2 0 1e3; down
P1 2_2 0_2; down, v=V_{o}
W 2 2_2; right, size=1.5
W 0_1 0; right
W 0 0_2; right
'''

fn="voltageDivider.sch"
with open(fn, "w") as text_file:
    text_file.write(sch)

#Create a circuit from a netlist file
cct = Circuit(fn)

The ability to create – and share – circuit diagrams in a Python context that plays nicely with Jupyter notebooks is handy, but the lcapy approach becomes really useful if we want to produce other assets around the circuit we’ve just created.

For example, in the case of the above circuit, how do the various voltage levels across the resistors respond when we switch on the sinusoidal source?

import numpy as np
t = np.linspace(0, 5, 1000)
vr = cct.R2.v.evaluate(t)
from matplotlib.pyplot import figure, savefig
fig = figure()
ax = fig.add_subplot(111, title='Resistor R2 voltage')
ax.plot(t, vr, linewidth=2)
ax.plot(t, cct.Vi.v.evaluate(t), linewidth=2, color='red')
ax.plot(t, cct.R1.v.evaluate(t), linewidth=2, color='green')
ax.set_xlabel('Time (s)')
ax.set_ylabel('Resistor voltage (V)');

Not the best example, admittedly, but you get the idea!

Here’s another example, where I’ve created a simple interactive to let me see the effect of changing one of the component values on the response of a circuit to a step input:

(The nice plotting of the diagram gets messed up unfortunately, at least in the way I’ve set things up for this example…)

As the code below shows, the @interact decorator from ipywidgets makes it trivial to create a set of interactive controls based around the arguments passed into a function:

import numpy as np
from matplotlib.pyplot import figure, savefig

@interact(R=(1,10,1))
def response(R=1):
    cct = Circuit()

    cct.add('V 0_1 0 step 10;down')
    cct.add('L 0_1 0_2 1e-3;right')
    cct.add('C 0_2 1 1e-4;right')
    cct.add('R 1 0_4 {R};down'.format(R=R))
    cct.add('W 0_4 0; left')

    t = np.linspace(0, 0.01, 1000)
    vr = cct.R.v.evaluate(t)

    fig = figure()
    #Note that we can add Greek symbols from LaTex into the figure text
    ax = fig.add_subplot(111, title='Resistor voltage (R={}$\Omega$)'.format(R))
    ax.plot(t, vr, linewidth=2)
    ax.set_xlabel('Time (s)')
    ax.set_ylabel('Resistor voltage (V)')
    ax.grid(True)
    
    cct.draw()

Using the network description of a circuit, it only takes a couple of lines to define a circuit and then get the transient response to step function for it:

Again, it doesn’t take much more effort to create an interactive that lets us select component values and explore the effect they have on the damping:

As well as the numerical analysis, lcapy also supports a range of symbolic analysis functions. For example, given a parallel resistor circuit, defined using a network description, we can find the overall resistance in simplest terms:

Or for parallel capacitors:

Some other elementary transformations we can apply – providing expressions for the an input voltage in the time or Laplace/s domain:

We can also create pole-zero plots quite straightforwardly, directly from an expression in the s-domain:

This is just a quick skim through of some of what’s possible with lcapy. So how and why might it be useful as part of a reproducible educational resource production process?

One reason is that several of the functions can reduce the “production distance” between different likely components of a set of educational materials.

For example, given a particular circuit description as a netlist, we can annotate it with direction cribs in order to generate a visual circuit diagram, and we can use a circuit created from it directly (or from the direction annotated script) to generate time or frequency response charts. (We can also obtain symbolic transfer functions.)

When trying to plot things like pole zero charts, where it is important that the chart matches a particular s-domain expression, we can guarantee that the chart is correct by deriving it directly from the s-domain expression, and then rendering that expression in pretty LaTeX equation form in the materials.

The ability to simplify expressions  – as in the example of the simplified expressions for overall capacitance or resistance in the parallel circuit examples above – directly from a circuit description whilst at the same time using that circuit description to render the circuit diagram, also reduces the amount of separation between those two outputs to zero – they are both generated from the self-same source item.

You can run the notebook (albeit subject to significant changes) that contains the original working for examples used in this post on Binderhub: Binder

Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy

Over the last couple of weeks I’ve been circling, but failing to make much actual progress on using, OpenStack as a platform for making self-serve OU hosted VMs available to students. (I’m increasingly starting to think this is not sensible, but I’m struggling to find someone I can chat to about it… OpenStack is too enterprise, like a heavy “Java” thing where I need a just works “Python” thing…).

Anyway.

One of the issues with the OU Faculty OpenStack setup is the way the security model locks everything down. Not only is no API access available, there is also a limit on IP address allocation and open ports are limited to port 80 (and maybe port 22? Or maybe not.)

For the TM351 VM – which is what we’re looking to put onto OU OpenStack – we have been exposing services on at least two http ports, one for the Jupyter notebooks and one for OpenRefine. (The latest build also has a simple VM webserver, and I’m experimenting with a notebook search engine. Optionally, we have also allowed students to open up ports to the PostgreSQL and MongoDB services.)

If I do find a sensible way to get the VM running on OpenStack, finding a way to shove all the http services through port 80 looks like a necessary requirement. Previously, I’d noticed that @betatim’s openrefineder demo made use of a proxy to expose the OpenRefine service via the Jupyter notebook port, and looking at it again today I noticed that the nbopenrefineproxy package it was using is available as a Jupyterhub project package: jupyterhub/nbserverproxy.

In the current TM351 VM set-up, we have the following:

  • Jupyter notebook on guest port 8888, host port 35180
  • OpenRefine on guest port 3334, host port 35181

However, if I install and enable nbserverproxy, and restart the Jupyter notebook server, I can now find OpenRefine proxied as http://localhost:35180/proxy/3334/ as well as on http://localhost:35181.

One gotcha to note is that the OpenRefine page doesn’t render properly from that URL without the trailing slash because the OpenRefine HTML includes relative links to assets:

...
<link type="text/css" rel="stylesheet" href="externals/select2/select2.css" />
 <link type="text/css" rel="stylesheet" href="externals/tablesorter/theme.blue.css" />
...

which resolve as e.g. http://localhost:35180/proxy/externals/select2/select2.css (404).

However, with the trailing slash, the links do resolve correctly (e.g. as http://localhost:35180/proxy/3334/externals/select2/select2.css) when the trailing slash is added.

Handy… and the way to go if we do get this running on OpenStack.

PS if you know of a baby steps tutorial that shows how I can build a custom VM image on a Mac that I can upload to OpenStack, please let me know via the comments. Or otherwise get in touch if you can talk me through the various approaches.

First Class R Support in Binder / Binderhub – Shiny Apps As Well as R-Kernels and RStudio

I notice from the binder-examples/r repo that Binderhub now appears to offer all sorts of R goodness out of the can, if you specify a particular R build.

From the same repo root, you can get:

And from previously, here’s a workaround for displaying R/HTMLwidgets in a Jupyter notebook.

OpenRefine is also available from a simple URL – https://mybinder.org/v2/gh/betatim/openrefineder/master?urlpath=openrefine – courtesy of betatim/openrefineder:

Perhaps it’s time for me to try to get my head round what the Jupyter notebook proxy handlers are doing…

PS see also Scripted Forms for a simple markdown script way of building interactive Jupyter widget powered UIs.

More Thoughts On Jupyter Notebook Search

Following on from initial sketch of Searching Jupyter Notebooks Using lunr, here’s a quick first pass [gist] at pouring Jupyter notebook cell contents (code and markdown) into a SQLite database, running a query over it and then inspecting the results using a modified NLTK text concordancer to show the search phrase in the context of where it’s located in a document.

The concordancer means we can offer a results listing more in accordance with a traditional search engine, showing just the text in the immediate vicinity of a search term. (Hmm, I’d need to check what happens if the search term appears multiple times in the search result text.) This means we can offer a tidier display the dumping the contents of a complete cell into the results listing.

The table the notebook data is added to is created so that it supports full text search. However, I imagine that any stemming that we could apply is not best suited to indexing code.

Similarly, the NLTK tokeniser doesn’t handle code very well. For example, splits occur around # and % symbols, which means things like magics, such as %load_ext, aren’t recognised; instead, they’re split into separate tokens: % and load_ext.

A bigger issue for the db approach is that I need to find a way to update / clean the database as and when notebooks are saved, updated, deleted etc.

PS sqlbiter provides a way of ingesting – and unpacking – JUpyter notebooks into a sqlite database.

PPS Handy Python command line tool for searching notebooks: https://github.com/conery/nbscan

Install it into TM351 VM from a Jupyter notebook code cell by running the following command when connected to the internet:

!sudo pip install git+https://github.com/conery/nbscan.git

Search for things in notebooks using commands like:

  • search in code cells in notebooks in current directory (.) and all child directories for a phrase: !nbscan.py --dir . --grep 'import pandas' --code
  • search in all cells for the word ‘pandas’: !nbscan.py --dir . --grep pandas
  • search in markdown cells for the pattern 'data repr\w*' (that is, the phrase starting data repr…):!nbscan.py --dir . --grep 'data repr\w*' --markdown

Would be handy to make a simple magic for this?

It might also be useful to take nbscan as a quick real time search tool then run results through the concordancer when displaying them?

Jigsaw Pieces – Linux Service Indicators, Jupyter Kernel Monitoring and Environment Management

Something I’ve been pondering for some time is how to set up some simple Linux service monitoring so that I can display an in indicator light in a web page to show whether a Linux service is running or not.

For example, in the TM351 VM, it could be handy to display some indicator lights in a Jupyter notebook status bar showing whether the database services we connect to from the notebooks are running correctly,

So here are some pieces that may contribute to that:

My thinking is:

  • use monit to monitor a process; if the process is down, write to a service status file in my www server directory, eg service_servicename_status.txt. If a service is running the contents of this file are 1, otherwise 0;
  • use the JQuery fragment to poll the status file every few seconds;
  • if the status file returns 0, display a red indicator, otherwise green.

Here are some other monitoring / environment managing fragments I’m pondering:

  • something like ps_mem, a Python utility *to accurately report the in core memory usage for a program*. I’m wondering if I could use that to track how much memory each Jupyter notebook python kernel is taking up (or maybe monit can do that?) There’s an old extnesion that looks like ti shows reports: nbtop. Or perhaps use psutil (via this issue, which seems to offer a solution?);
  • a minimal example of setting up notebook homepage tab for a hello world webpage; Writing a notebook server extension looks like it has the ingredients, and nb_conda provides a fuller working example. Actually, that extension looks useful for *Jupyter-as-a-learning-environment* because it lets you select different conda environments, which could be handy for running different activities.

Any other examples out there of Jupyter monitoring / environment management?