Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

notebooks:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/notebooks:/notebooks"
data:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/data:/data"

devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data

devpostgres:
    environment:
        - POSTGRES_PASSWORD
    image: postgres
    ports:
        - "5433:5432"
    volumes_from:
        - devpostgresdata

notebook:
    environment:
        - PASSWORD
    image: calvingiles/data-science-environment
    links:
        - devpostgres:postgres
    ports:
        - "443:8888"
    volumes_from:
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm
PASSWORD=MyPass POSTGRES_PASSWORD=PGPass fig up -d

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os
os.environ

Dockerising Open Data Databases – First Fumblings

A new year, and I’m trying to reclaim some joy, fun, tech mojo by getting back into the swing of appropriating stuff and seeing what we can do with it. My current universal screwdriver is Docker, so I spent an hour or two this yesterday afternoon and a large chunk of today poking around various repos looking for interesting things to make use of.

The goal I set myself was to find a piece of the datawrangling jigsaw puzzle that would let me easily load a previously exported SQL database into MySQL running in a container, the idea being we should be able to write a single command that points to a directory containing some exported SQL stuff, and then load that data straight into a container in which you can start querying it . (There’s possibly (probably?) a well known way of doing this for the folk who know how to do it, but what do I know….?!;-)

The solution I settled with was a set of python scripts published by Luis Elizondo / luiselizondo [about] that wrap some docker commands to produce a commandline tool for firing up and populating MySQL instances, as well as connecting to them: docker-mysql-scripts.

The data I thought I’d play with is the 2013 UK MOT results database, partly because I found a database config file (that I think may originally have been put together by Ian Hopkinson). My own archive copy of the relevant scripts (including the sql file to create the necessary MOT database tables and boot load the MOT data) can be found here. Download the scripts and you should be able to run them (perhaps after setting permissions to make them executable) as follows:

#Scripts: https://gist.github.com/psychemedia/86980a9e86a6de87a23b
#Create a container mot2013 with MySQL pwd: mot
#/host/path/to/ should contain:
## mot_boot.sql #Define the database tables and populate them
## test_result_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_result_2013.txt.gz 
## test_item_2013.txt  #download and uncompress: http://data.dft.gov.uk/anonymised-mot-test/12-03/test_item_2013.txt.gz 
## mdr_test_lookup_tables/ #download and unzip: http://data.dft.gov.uk/anonymised-mot-test/mdr_test_lookup_tables.zip

#Create a container 'mot2013' with MYSQL password: mot
dmysql-server mot2013 mot

#Create a database: motdata
dmysql-create-database mot2013 motdata

#Populate the database using the mot_boot.sql script
dmysql-import-database mot2013 /host/path/to/mot_boot.sql --database motdata

I can then login in to the database with the command dmysql mot2013, connect to the appropriate database from the MySQL client with the SQL command USE motdata; and then run queries on the 2013 MOT data.

The scripts also allow the database contents to be managed via a separate data container:

#Create a new container called: test with an external data container
dmysql-server --with-volume test test
#This creates a second container - TEST_DATA - that manages the database files
#The name is given by: upper($CONTAINERNAME)+'_DATA'

#Create a dummy db
dmysql-create-database test testdb

#We can connect to the database as before
#dmysql test
mysql> exit

#Delete this container, removing any attached volumes with the -v flag
docker rm -v -f test

#Create a new container connected to the original TEST_DATA container
docker run --volumes-from TEST_DATA --name test4 -e MYSQL_ROOT_PASSWORD="test" -d mysql

#Connect to this new container
dmysql test4
mysql> SHOW DATABASES;
#We should see the testdb database there...

#NOTE - I think that only one database server container can be connected to the TEST_DATA container at a time

So far, so confusing… Here’s a brief description of the current state of my understanding/confusion:

What really threw me is that the database container (either the original mot2013 container or the test database with the external data container) don’t appear to store the data inside the container itself. (So for example, the TEST_DATA container does not contain the database.) Instead, the data appears to be contained in a “hidden” volume that is mounted outside the container. I came a cropper with this originally by deleting containers using commands of the form docker rm [CONTAINER_NAME] and then finding that the docker VM was running out of memory. This deletes the container, but leaves a mounted volume (that is associated with the deleted container name) hanging around. To remove those volumes automatically, containers should be removed with commands of the form docker rm -v [CONTAINER_NAME]. What makes things difficult to tidy up is that the mounted volumes can’t be seen using the normal docker ps or docker ps -a commands; instead you need to install docker-volumes to identify them and delete them. (There’s a possible fix that shows how to store the data inside the container, rather than in an externally mounted volume – I think?! – linked to from this post, but I haven’t tried it because I was trying to use root Dockerfile images.)

The use of the external data volumes also means we can’t easily bundle up a data container using docker commit and then hope to create new containers from it (such a model would allow you to spend an hour or two loading a large-ish data set into a database, then push a data container containing that db to dockerhub; other users could then pull down that image, create a container from it and immediately attach it to a MySQL container without having to go through the pain of building the database; this would provide a nifty way of sharing ready-to-query datasets such as the MOT database. You could just pull a data image, mount it as the data volume/container, and get started with running queries).

On the other hand, it is possible to mount a volume inside a container by running the container with a -v flag and specifying the mount point (docker volumes). Luis Elizondo’s scripts allow you to set-up these data volume containers by running dmysql-server with the –with-volume flag as shown in the code snippet above, but at the moment I can’t see how to connect to a pre-existing data container? (See issue here.)

So given that we can mount volumes inside a linked to data container, does this mean we have a route to portable data containers that could be shared by a docker datahub? It seems not… because as it was, I wasted quite a bit more time learning the fact that docker data Container volumes can’t be saved as images! (To make a data container portable, I think we’d need to commit it as an image, share the image, then create a container from that image? Or do I misunderstand this aspect of the docker workflow too…?!)

That said, there does look to be a workaround in the form of flocker. However, I’ve fed up with this whole thing for a bit now… What I hoped would be a quick demo of: get data in docker MySQL container; package container and put image on dockerhub; pull down image, create container and start using data immediately turned into a saga of realising quite how much I don’t understand docker, what it does and how it does it.

I hate computers…. :-(

MOOC Reflections

A trackback a week or two ago to my blog from this personal blog post: #SNAc week 1: what are networks and what use is it to study them? highlighted me to a MOOC currently running on Coursera on social network analysis. The link was contextualised in the post as follows: The recommended readings look interesting, but it’s the curse of the netbook again – there’s no way I’m going to read a 20 page PDF on a screen. Some highlighted resources from Twitter and the forum look a bit more possible: … Some nice ‘how to’ posts: … (my linked to post was in the ‘howto’ section).

The whole MOOC hype thing at the moment seems to be dominated by references to the things like Coursera, Udacity and edX (“xMOOCs”). Coursera in particularly is a new sort of intermediary, a website that offers some sort of applied marketing platform to universities, allowing them to publish sample courses in a centralised, browsable, location and in a strange sense legitimising them. I suspect there is some element of Emperor’s New Clothes thinking going on in the universities who have opted in and those who may be considering it: “is this for real?”; “can we afford not to be a part of it?”

Whilst Coursera has an obvious possible business model – charge the universities for hosting their marketing material courses – Udacity’s model appears more pragmatic: provide courses with the option of formal assessment via Pearson VUE assessment centres, and then advertise your achievements to employers on the Udacity site; presumably, the potential employers and recruiters (which got me thinking about what role LinkedIn might possibly play in this space?) are seen as the initial revenue stream for Udacity. Note that Udacity’s “credit” awarding powers are informal – in the first instance, credibility is based on the reputation of the academics who put together the course; in contrast, for courses on Coursera, and the rival edX partnership (which also offers assessment through Pearson VUE assessment centres), credibility comes from the institution that is responsible for putting together the course. (It’s not hard to imagine a model where institutions might even badge courses that someone else has put together…)

Note that Coursera, Udacity and edX are all making an offering based on quite a traditional course model idea and are born out of particular subject disciplines. Contrast this in the first part with something like Khan Academy, which is providing learning opportunities at a finer level of granularity/much smaller “learning chunks” in the form of short video tutorials. Khan Academy also provides the opportunity for Q&A based discussion around each video resource.

Also by way of contrast are the “cMOOC” style offerings inspired by the likes of George Siemens, Stephen Downes, et al., where a looser curriculum based around a set of topics and initially suggested resources is used to bootstrap a set of loosely co-ordinated personal learning journeys: learners are encouraged to discover, share and create resources and feed them into the course network in a far more organic way than the didactic, rigidly structured approach taken by the xMOOC platforms. The cMOOC style also offeres the possibility of breaking down subject disciplines through accepting shared resources contributed because they are relevant to the topic being explored, rather than because they are part of the canon for a particular discipline.

The course without boundaries approach of Jim Groom’s ds106, as recently aided and abetted by Alan Levine, also softens the edges of a traditionally offered course with its problem based syllabus and open assignment bank (particpants are encouraged to submit their own assignment ideas) and turns learning into something of a lifestyle choice… (Disclaimer: regular readers will know that I count the cMOOC/ds106 “renegades” as key forces in developing my own thinking…;-)

Something worth considering about the evolution of open education from early open content/open educational resource (OER) repositories and courseware into the “Massive Open Online Course” thing is just what caused the recent upsurge in interest? Both MIT opencourseware and the OU’s OpenLearn offerings provided “anytime start”, self-directed course units; but my recollection is that it was Thrun & Norvig’s first open course on AI (before Thrun launched Udacity), that captured the popular (i.e. media) imagination because of the huge number of students that enrolled. Rather than the ‘on-demand’ offering of OpenLearn, it seems that the broadcast model, and linear course schedule, along with the cachet of the instructors, were what appealed to a large population of demonstrably self-directed learners (i.e. geeks and programmers, who spend their time learning how to weave machines from ideas).

I also wonder whether the engagement of universities with intermediary online course delivery platforms will legitimise online courses run by other organisations; for example, the Knight Centre Massive Open Online Courses portal (a Moodle environment) is currently advertising it’s first MOOC on infographics and data visualisation:

Similar to other Knight Center online courses, this MOOC is divided into weekly modules. But unlike regular offerings, there will be no application or selection process. Anyone can sign up online and, once registered, participants will receive instructions on how to enroll in the course. Enrollees will have immediate access to the syllabus and introductory information.

The course will include video lectures, tutorials, readings, exercises and quizzes. Forums will be available for discussion topics related to each module. Because of the “massive” aspect of the course, participants will be encouraged to provide feedback on classmates’ exercises while the instructor will provide general responses based on chosen exercises from a student or group of students.

Cairo will focus on how to work with graphics to communicate and analyze data. Previous experience in information graphics and visualization is not needed to take this course. With the readings, video lectures and tutorials available, participants will acquire enough skills to start producing compelling, simple infographics almost immediately. Participants can expect to spend 4-6 hours per week on the course.

Although the course will be free, if participants need to receive a certificate, there will be a $20 administrative fee, paid online via credit card, for those who meet the certificate requirements. The certificate will be issued only to students who actively participated in the course and who complied with most of the course requirements, such as quizzes and exercises. The certificates will be sent via email as a PDF document. No formal course credit of any kind is associated with the certificate.

Another of the things that I’ve been pondering is the role that “content” may or not play a role in this open course thing. Certainly, where participants are encouraged to discover and share resources, or where instructors seek to construct courses around “found resources”, an approach espoused by the OU’s new postgraduate strategy, it seems to me that there is an opportunity to contribute to the wider open learning idea by producing resources that can be “found”. For resources to be available as found resources, we need the following:

  1. Somebody needs to have already created them…
  2. They need to be discoverable by whoever is doing the finding
  3. They need to be appropriately licensed (if we have to go through a painful rights clearnance and rights payment model, the cost benefits of drawing on and freely reusing those resources are severely curtailed).

Whilst the running of a one shot MOOC may attract however many participants, the production of finer grained (and branded) resources that can be used within those courses means that a provider can repeatedly, and effortlessly, contribute to other peoples courses through course participants pulling the resources into those coure contexts. (It also strikes me that educators in one institution could sign up for a course offered by another, and then drop in links to their own applied marketing learning materials.)

One thing I’ve realised from looking at Digital Worlds uncourse blog stats is that some of the posts attract consistent levels of traffic, possibly because they have been embedded to from other course syllabuses. I also occasionally see flurries of downloads of tutorial files, which makes me wonder whether another course has linked to resources I originally produced. If we think of the web in it’s dynamic and static modes (static being the background links that are part of the long term fabric of the web, dynamic as the conversation and link sharing that goes on in social networks, as well as the publication of “alerts” about new fabric (for example, the publication of a new blog post into the static fabric of the web is announced through RSS feeds and social sharing as part of the dynamic conversation)), then the MOOCs appear to be trying to run in a dynamic, broadcast mode. Whereas what interests me is how we can contribute to the static structure of the web, and how we can make better use of it in a learning context?

PS a final thought – running scheduled MOOCs is like a primetime broadcast; anytime independent start is like on-demand video. Or how about this: MOOCs are like blockbuster books, published to great fanfare and selling millions of first day, pre-ordered copies. But there’s also long tail over time consumption of the same books… and maybe also books that sell steadily over time without great fanfare. Running a course once is all well and good; but it feels too ephemeral, and too linear rather than networked thinking to me?