Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

    command: echo created
    image: busybox
        - "~/Google Drive/notebooks:/notebooks"
    command: echo created
    image: busybox
        - "~/Google Drive/data:/data"

    command: echo created
    image: busybox
        - /var/lib/postgresql/data

    image: postgres
        - "5433:5432"
        - devpostgresdata

        - PASSWORD
    image: calvingiles/data-science-environment
        - devpostgres:postgres
        - "443:8888"
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os


  1. CraigM

    For tech training (hw/sw support/service engineer, partners and customers) we deployed a very similar solution many years ago utilising Solaris/illumos zones sitting atop ZFS as a filesystem. The significant advantage there was that the underlying filesystem brought snapshot/rollback capability to each of the containers filesystems and therefore it was very easy not only to deploy known states, but also to record check states which could be rolled forward/back corresponding to class state. Most of the other advantages you cite from Docker use were available from zones, where our policy was very prescient, with one app/one zone policies generally.

    Made complicated training classes with many interacting components very easy to run/deploy. Particularly when someone got out of step with the class, moving them back on track was very easy.

    More recently I’ve deployed the same principle but utilising SmartOS/OmniOS, which has the advantage of simple deployment of x86 environments running inside the same zone architecture. Virtual switch configs and firewall like setups are simple utilising the embedded networking and deploying from a single image (particularly with SmartOS) has been great for class and remote runs.

    Can’t recommend it enough …

    • Tony Hirst

      That’s interesting – I need to ponder that (eg what we can learn from support/service training models in hardware/software support when it comes to trying to design ways of deploying VM configurations to students that helpdesk stand a chance of being able to support, or that we can help students debug via FAQs).

      It also makes me realise just how far away I am from knowing what goes on in the real world!;-)