Tagged: TM351

Rethinking the TM351 Virtual Machine Again, Again…

It’s getting to that time when we need to freeze the virtual machine build we’re going to use for the new (postponed) data course, which should hopefully go live to students in February, 2016, and I’ve been having a rethink about how to put it together.

The story so far has been documented in several blog posts and charts my learning journey from knowing nothing about virtual machines (not sure why I was given the task of putting it together?!) to knowing how little I know about building Linux administration, PostgreSQL, MongoDB, Linux networking, virtual machines and virtualisation (which is to say, knowing I don’t know enough to do any of this stuff properly…;-)

The original plan was to put everything into a single VM and wire all the bits together. One of the activities needed to fire up several containers as part of a mongo replica set, and I opted to use containers to do that.

Over the last few months, I started to wonder whether we should containerise everything separately, then deploy compositions of containers. The rationale behind this approach is that it means we could make use of a single VM to host applications for several users if we get as far as cloud hosting services/applications for out students. It also means students can start, stop or “reinstall” particular applications in isolation from the other VM applications they’re running.

I think I’ve got this working in part now, though it’s still very much tied to the single user – I’m doing things with permissions that would never be allowed (and that would possibly break things..) if we were running multiple users in the same VM.

So what’s the solution? I posted the first hints in Kiteflying Around Containers – A Better Alternative to Course VMs? where I proved to myself I could fire up an IPthyon notebook server on top of scientific distribution stack, and get the notebooks talking to a DBMS running in another container. (This was point and click easy, once you know what to click and what numbers to put where.)

The next step was to see if I could automate this in some way. As Kitematic is still short of a Windows client, and doesn’t (yet?) support Docker Compose, I thought I’d stick with vagrant (which I was using to build the original VM using a Puppet provision and puppet scripts for each app) and see if I could get it provision a VM to run containerised apps using docker. There are still a few bits to do – most notably trying to get the original dockerised mongodb stuff working, checking the mongo link works, working out where to try to persist the DBMS data files (possibly in a shared folder on host?) in a way that doesn’t trash them each time a DBMS container is started, and probably a load of other stuff – but the initial baby steps seem promising…

In the original VM, I wanted to expose a terminal through the browser, which meant pfaffing around with tty.js and node.js. The latest Jupyter server includes the ability to launch a browser based shell client, which meant I could get rid of tty.js. However, moving the IPython notebook into a container means that the terminal presumably has scope only within that container, rather than having access to the base VM command line? For various reasons, I intend to run the IPython/Jupyter notebook server container as a privileged container, which means it can reach outside the container (I think? The reason? eg to fire up containers for the mongo replica set activity) but I’m not sure if this applies to the command line/terminal app too? Though offhand, I can’t think why we might want to provide students with access to the base VM command line?

Anyway, the local set-up looks like this…

A simple Vagrantfile, called using vagrant up or vagrant reload. I have extended vagrant using the vagrant-docker-compose plugin that supports Docker Compose (fig, as was) and lets me fired up wired-together container configurations from a single script:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.network(:forwarded_port, guest: 9000, host: 9000)
  config.vm.network(:forwarded_port, guest: 8888, host: 8351,auto_correct: true)

  config.vm.provision :docker
  config.vm.provision :docker_compose, yml: "/vagrant/docker-compose.yml", rebuild: true, run: "always"

The YAML file identifies the containers I want to run and the composition rules between them:

  image: dockerui/dockerui
    - "9000:9000"
    - /var/run/docker.sock:/var/run/docker.sock
  privileged: true

  build: ./tm351_scipystacknserver
    - "8888:8888"
    - ./notebooks/:/notebooks/
    - devpostgres:postgres
  privileged: true
    command: echo created
    image: busybox
        - /var/lib/postgresql/data
        - POSTGRES_PASSWORD=whatever
    image: postgres
        - "5432:5432"
        - devpostgresdata

At the moment, Mongo is still missing and I haven’t properly worked out what to do with the PostgreSQL datastore – the idea is that students will be given a pre-populated, pre-indexed database, in part at least.

One additional component that sort of replaces the command line/terminal app requirement from the original VM is the dockerui app. This runs in its own container with privileged access to the docker environment and that provides a simple control panel over all the containers:


What else? The notebook stuff has a shared notebooks directory with host, and is built locally (from a Dockerfile in the local tm351_scipystacknserver directory) on top of the ipython/scipystack image; extensions include some additional package installations (requiring both apt-get and pip installs) and copying across and running a custom IPython notebook template configuration.

FROM ipython/scipystack


ADD build_tm351_stack.sh /tmp/build_tm351_stack.sh
RUN bash /tmp/build_tm351_stack.sh

ADD ipynb_style /tmp/ipynb_style
ADD ipynb_custom.sh /tmp/ipynb_custom.sh
RUN bash /tmp/ipynb_custom.sh

## Extremely basic test of install
RUN python2 -c "import psycopg2, sqlalchemy"
RUN python3 -c "import psycopg2, sqlalchemy"

# Clean up from build
RUN rm -f /tmp/build_tm351_stack.sh
RUN rm -f /tmp/ipynb_custom.sh
RUN rm -f -r /tmp/ipynb_style

VOLUME /notebooks
WORKDIR /notebooks


ADD notebook.sh /
RUN chmod u+x /notebook.sh

CMD ["/notebook.sh"]


If we need to extend the PostgreSQL build, that can be presumably done using a Dockerfile that pulls in the core image and then runs an additional configuration script over it?

So where am I at? No f****g idea. I thought that between the data course and the new web apps course we might be able to explore some interesting models of using virtual machines (originally) and containers (more recently) in a distance education setting, that could cope with single user home use, computer training room/lab use, cloud use, but, as ever, I have spectacularly failed to demonstrate any sort of “academic leadership” in developing these ideas within the OU, or even getting much of a conversation going in the first place. Not in my skill set, I guess!;-) Though perhaps not in the institution’s interests either. Recamp. Retrench. Lockdown. As per some of the sentiments in Reflections on the Closure of Yahoo Pipes, perhaps? Don’t Play Here.

Recreating a Node.js Installation – Package Versions

Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.


Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:

### Show nodejs version and packages
> node -v

> npm list -g
├─┬ npm@1.4.28
│ ├── abbrev@1.0.5
│ ├── ansi@0.3.0
│ ├── ansicolors@0.3.2
│ ├── ansistyles@0.1.3
│ ├── archy@0.0.2
│ ├── block-stream@0.0.7
│ └── which@1.0.5
└─┬ tty.js@0.2.13
  ├─┬ express@3.1.0
  │ ├── buffer-crc32@0.1.1

Using this information, I could then use nvm, a node.js version manager, installed via:

curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash

to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js

(I should probably add the tty.js version in there too? npm install tty.js@0.2.13 perhaps? )

The terminal can then be run as a demon from:

/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize

What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?

It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:

### Show packages
dpkg -l

And we can get a list of Python packages – and their version numbers – using:

### Show Python packages
pip3 list

Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?

Notes on Data Quality…

You know how it goes – you start trying to track down a forward “forthcoming” reference and you end up wending your way through all manner of things until you get back to where you started none the wiser… So here’s a snapshot of several docs I found trying to source the original forward reference for following table, found in Improving information to support decision making: standards for better quality data (November 2007, first published October 2007) with the crib that several references to it mentioned the Audit Commission…


The first thing I came across was The Use of Information in Decision Making – Literature Review for the Audit Commission (2008), prepared by Dr Mike Kennerley and Dr Steve Mason, Centre for Business Performance Cranfield School of Management, but that wasn’t it… This document does mention a set of activities associated with the data-to-decision process: Which Data, Data Collection, Data Analysis, Data Interpretation, Communication, Decision making/planning.

The data and information definitions from the table do appear in a footnote – without reference – in Nothing but the truth? A discussion paper from the Audit Commission in Nov 2009, but that’s even later… The document does, however, identify several characteristics (cited from an earlier 2007 report (Improving Information, mentioned below…), and endorsed at the time by Audit Scotland, Northern Ireland Audit Office, Wales Audit Office and CIPFA, with the strong support of the National Audit Office), that contribute to a notion of “good quality” data:

Good quality data is accurate, valid, reliable, timely, relevant and complete. Based on existing guidance and good practice, these are the dimensions reflected in the voluntary data standards produced by the Audit Commission and the other UK audit agencies
* „Accuracy – data should be sufficiently accurate for the intended purposes.
* Validity – data should be recorded and used in compliance with relevant requirements.
„* Reliability – data should reflect stable and consistent data collection processes across collection points and over time.
„* Timeliness – data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period.
„* Relevance – data captured should be relevant to the purposes for which it is used.
„* Completeness – data requirements should be clearly specified based on the information needs of the body and data collection processes matched to these requirements.

The document also has some pretty pictures, such as this one of the data chain:


In the context of the data/information/knowledge definitions, the Audit Commission discussion document also references the 2008 HMG strategy document Information matters: building government’s capability in managing knowledge and information which includes the table in full; a citation link is provided, but 404s, but a source is given to the November 2008 version of Improving information, the one we originally started with. So the original reference forward refers the table to an unspecified report, but future reports in the area refer back to that “original” without making a claim to the actual table itself?

Just in passing, whilst searching for the Improving information report, I actually found another version of it… Improving information to support decision making: standards for better quality data, Audit Commission, first published March 2007.


The table and the definitions as cited in Information Matters do not seem to appear in this earlier version of the document?

PS Other tables do appear in both versions of the report. For example, both the March 2007 and November 2007 versions of the doc contain this table (here, taken from the 2008 doc) of stakeholders:


Anyway, aside from all that, several more documents for my reading list pile…

PS see also Audit Commission – “In the Know” from February 2008.

Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

    command: echo created
    image: busybox
        - "~/Google Drive/notebooks:/notebooks"
    command: echo created
    image: busybox
        - "~/Google Drive/data:/data"

    command: echo created
    image: busybox
        - /var/lib/postgresql/data

    image: postgres
        - "5433:5432"
        - devpostgresdata

        - PASSWORD
    image: calvingiles/data-science-environment
        - devpostgres:postgres
        - "443:8888"
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os

Edtech and IPython Notebooks – Activities and Answer Reveals

A few months ago I posted about an interaction style that I’d been using – and that stopped working – in IPython notebooks: answer reveals.

An issue I raised on the relevant git account account turned up a solution that I’ve finally gotten round to trying out – and extending with a little bit of styling. I’ve also reused chunks from another extension (read only code cells) to help style other sorts of cell.

Before showing where I’m at with the notebooks, here’s where OU online course materials are at the moment.

Teaching text is delivered via the VLE (have a look on OpenLearn for lots of examples). Activities are distinguished from “reading” text by use of a coloured background.


The activity requires a student to do something, and then a hidden discussion or anser can be revealed so that the student can check their answer.


(It’s easy to be lazy, of course, and just click the button without really participating in the activity. In print materials, a frictional overhead was added by having answers in the back of the study guide that you would have to turn to. I often wonder whether we need a bit more friction in the browser based material, perhaps a time based one where the button can’t be clicked for x seconds after the button is first seen in the viewport (eg triggered using a test like this jQuery isOnScreen plugin)?!)

I drew on this design style to support the following UI in an IPython notebook:

Here’s a markdown cell in activity style and activity answer style.


The lighter blue background set into the activity is an invitation for students to type something into those cells. The code cell is identified as such by the code line In [ ] label. Whatever students type into those cells can be executed.

The heading is styled from a div element:


If we wanted to a slightly different header background style as in the browser materials, we could perhaps select a notebook heading style and then colour the background differently right across the width of the page. (Bah.. should have thought of that earlier!;-)

Markdown cells can also be styled to prompt students to make a text response (i.e. a response written in markdown, though I guess we could also use raw text cells). I don’t yet have a feeling for how much ownership students will take of notebooks and start to treat them as workbooks?

Answer reveal buttons can also be added in:


Clicking on the Answer button displays the answer.


At the moment, the answer background is the same colour as the invitation for student’s to type something, although I guess we could also interpret as part of the tutor-alongside dialogue, and the lighter signifies expected dialogic responses whether from the student or the “tutor” (i.e. the notebook author).

We might also want to make use of answer buttons after a code completion activity. I haven’t figured out the best way to do this as of yet.


At the moment, the answer button only reveals text – and even then the text needs to be styled as HTML (the markdown parsing doesn’t work:-(


I guess one approach might be to spawn a new code cell containing some code written in to the answer button div. Another might be to populate a code cell following the answer button with the answer code, hiding the cell and disabling it (so it can’t be executed/run), then revealing it when the answer button is clicked? I’m also not sure whether answer code should be runnable or not?

The mechanic for setting the cell state is currently a little clunky. There are two extensions, one for the answer button, one for setting the state other cells, that use different techniques for addressing the cells (and that really need to be rationalised). The extensions run styling automatically when the notebook is started, or when triggered. At the moment, I’m using icons from the orgininal code I “borrowed” – which aren’t ideal!


The cell state setter button toggles selected code cells from activity to not-activity states, and markdown cells from activity-description to activity-student-answer to not-activity. The answer button button adds an answer button at every answer div (even if there’s already an answer button rendered). Both extensions style/annotate restarted notebooks correctly.

The current, hacky, user model I have in mind is that authors have an extended notebook with buttons to set the cell styles, and students have an extended notebook without buttons that just styles the notebook when it’s opened.

FWIW, here’s the gist containing extensions code.

Comments/thoughts appreciated…

Assessing Data Wrangling Skills – Conversations With Data

By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).

The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.

As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.

To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.

In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).

If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.

Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:

– downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).

The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.

Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.

Anscombe’s Quartet – IPython Notebook

Anyone who’s seen one of my talks that even touches on data and visualisation will probably know how it like to use Anscombe’s Quartet as a demonstration of why it makes sense to look at data, as well as to illustrate the notion of a macroscope, albeit one applied to a case of N=all where all is small…

Some time ago I posted a small R demo – The Visual Difference – R and Anscombe’s Quartet. For the new OU course I’m working on (TM351 – “The Data Course”), our focus is on using IPython Notebooks. And as there’s a chunk in the course about dataviz, I feel more or less obliged to bring Anscombe’s Quartet in:-)

As we’re still finding our way about how to make use of IPython Notebooks as part of an online distance education course, I’m keen to collect feedback on some of the ways we’re considering using the notebooks.

The Anscombe’s Quartet notebook has quite a simple design – we’re essentially just using the cells as computed reveals – but I’m still be keen to hear any comments about how well folk think it might work as a piece of standalone teaching material, particularly in a distance education setting.

The notebook itself is on github (ou-tm351), along with sample data, and a preview of the unexecuted notebook can be viewed on nbviewer: Anscombe’s Quartet – IPython Notebook.

Just by the by, the notebook also demonstrates the use of pandas for reshaping the dataset (as well as linking out to a demonstration of how to reshape the data using OpenRefine) and the ŷhat ggplot python library (docs, code) for visualising the dataset.

Please feel free to post comments here or as issues on the github repo.