Getting Web Services Up and Running on MicroSoft Azure Using Vagrant and the Azure CLI

As well as Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI, we can also use Vagrant to provision machines on other web hosts, such as the Microsoft Azure cloud paltform. In this post, I’ll describe a command line based recipe for doing just that.

To start with, you’ll need to get a Microsoft Azure account.

When you’ve done that, install the Azure command line interface (CLI). On a Mac:

curl -L https://aka.ms/InstallAzureCli | bash

For me, this installed to ~/bin/az.

With the client installed, login:

~/bin/az login

This requires a token based handshake with a Microsoft authentication website.

List the range of machine images available (if you haven’t set the path to az, use the full ~/bin/az):

az vm image list

There was only one that looked suitable to me for my purposes: Canonical:UbuntuServer:16.04-LTS:latest.

To run the provisioner, we need a Subscription ID; this will be used to set the vagrant .subscription_id parameter. These are listed on the Azure Portal.

We also need to create an Active Directory Service Principal:

az ad sp create-for-rbac

This information will be used to configure the Vagrantfile: the appId sets the vagrant .client_id, the password the .client_secret, and the tenant the .tenant_id.

You can also inspect the application in the App Registrations area of the Azure Portal.

Now let’s set up Vagrant. We’ll use the vagrant-azure plugin:

vagrant plugin install vagrant-azure --plugin-version '2.0.0.pre6'

We need to add a dummy box:

vagrant box add azure https://github.com/azure/vagrant-azure/raw/v2.0/dummy.box

Now let’s set up the Vagrantfile:

config.vm.provider :azure do |azure, override|
    #The path to your ssh keys
    override.ssh.private_key_path = '~/.ssh/id_rsa'

    #The default box we added
    override.vm.box = 'azure'
    
    #Set a territory
    azure.location="uksouth"

    #Provide your own group and VM name
    azure.resource_group_name="tm351azuretest"
    azure.vm_name="tm351azurevmtest"

    # Set an appropriate image (the UbuntuServer is actually the current default value)
    azure.vm_image_urn="Canonical:UbuntuServer:16.04-LTS:latest"

    #Use a valid subscription ID
    #https://portal.azure.com/#blade/HubsExtension/MyAccessBlade/resourceId/
    azure.subscription_id = ENV['AZURE_SUBSCRIPTION_ID']

    # Using details from the Active Directory Service Principal setup
    azure.tenant_id = ENV['AZURE_TENANT_ID']
    azure.client_id = ENV['AZURE_CLIENT_ID']
    azure.client_secret = ENV['AZURE_CLIENT_SECRET']

end

With the Vagrantfile parameters in place, we should then be able call the Azure provider using the command:

vagrant up --provider=azure

But we’re still not quite done… If you’re running services on the VM, populated from elsewhere in the Vagranfile, you’ll need to add some security rules to make the ports accessible. I’m running services on ports 80,35180 and 35181 for example:

az network nsg rule create -g tm351azuretest --nsg-name tm351azurevmtest-vagrantNSG --priority 130 --direction Inbound --destination-port-ranges 8888 --access Allow --name port_8888
az network nsg rule create -g tm351azuretest --nsg-name tm351azurevmtest-vagrantNSG --priority 130 --direction Inbound --destination-port-ranges 35181 --access Allow --name port_35181

Now we can lookup the IP address of the server:

az vm list-ip-addresses

and see if our applications are there :-)

Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI

From past experience of trying to get things up and running with AWS (Amazon Web Services), it can be a bit of a faff trying to work out what to set where the first time. So here’s an example of how to get a browser based application up and running on EC2 using vagrant from the command line.

(If you want to work through sorting the settings out via the AWS online management console, try Oliver Veits’ tutorial AWS Automation based on Vagrant — Part 2: Installation and Usage of the Vagrant AWS Plugin; you might also need to refer to Part 1: Getting started with AWS.)

This post in part assumes you know how to provision your own virtual machine locally using Vagrant. Here are the steps you need to take to be able to run an AWS provisioner (on a Mac or Linux machine… not sure about Windows?).

First up – sign up for AWS (get credit via the Github Education Pack)…

Pick up some credentials via AWS root Security Credentials (Access Keys (Access Key ID and Secret Access Key)):

Ensure that the key is active (Make Active).

There’s quite a bit of set up to do to configure the provisioner script. This can be done on the command line using the Amazon Command Line Interface (AWS CLI):

pip install --upgrade --user awscli

Now you need to configure the AWS CLI:

aws configure

Use the security credentials you picked up to configure the client*.

When we launch the AWS machine, vagrant needs to be able to access it via ssh using the public IP address automatically assigned to the machine. In deployment too, if we’re building specific services we want to be able to access over the web, we need to open up access to the ports those services are listening on.

By default, the machine will be locked down, so we need to open up specific ports by setting security rules. These are assigned on the basis of a security group. So lets create one of those (mine is named after the course VM I’m building…):

aws ec2 create-security-group --group-name tm351cloud --description "Security group for tm351 services"

We’re going to use this group in the .security_groups parameter in the Vagrantfile.

Now we need to create the security group rules. In my case, I want to open up ssh (port 22) to allow incoming traffic from my IP address, and ports 80, 35180 and 35181 to allow http traffic from anywhere. (The /0 suffix in the rules allows any IP format.)

MYIP=$(curl http://checkip.amazonaws.com/)
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 22 --cidr ${MYIP}/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35180 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35181 --cidr 0.0.0.0/0

# Check the policies
aws ec2 describe-security-groups --group-names tm351cloud

Having opened up at least the ssh port 22, we need to set up some SSH keys with a particular name (vagrantaws) that we will use with the vagrant .keypair_name parameter, and save them to a local file (vagrantaws.pem) with the appropriate permissions.

aws ec2 create-key-pair --key-name vagrantaws --query 'KeyMaterial' --output text > vagrantaws.pem
chmod 400 vagrantaws.pem

The vagrant provisioner also requires specific access tokens (.access_key_id, .secret_access_key, .session_token) to access EC2. Create these tokens, entering your own duration (in seconds):

aws sts get-session-token --duration-seconds 129600

Now we can start to look at the Vagrant set up. Install the vagrant AWS provisioner:

vagrant plugin install vagrant-aws

After setting up the Vagrantfile, you will be able to provision your machine on AWS using:

vagrant up --provider=aws

Add a dummy box:

vagrant box add awsdummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Now let’s look at the Vagrantfile:

#Set up the provider block
config.vm.provider :aws do |aws, override|

    #Use the ec2 security group set previously
    aws.security_groups = ["tm351cloud"]

    #Whatever name we want
    override.vm.hostname = "tm351aws"

    #The name of the dummy box we added
    override.vm.box = "awsdummy"
    
    #Set up machine access using our keypair name and ssh key path
    override.ssh.username = "ubuntu"
    aws.keypair_name="vagrantaws"
    override.ssh.private_key_path = "vagrantaws.pem"

    #Use the values generated by the session token generator
    aws.access_key_id = "YOUR_KEY_ID"
    aws.secret_access_key = "YOUR_SECRET_ACCESS_KEY"
    aws.session_token = "YOUR_SESSION_TOKEN"

    #Specify a region and valid ami for that region, along with the desired instance size
    aws.region = "eu-west-1"
    aws.ami = "ami-971238f1" 
    aws.instance_type="t2.small"

  end

Running vagrant up --provider=aws should run the Vagrant provisioner with the AWS provider. Running vagrant destroy will tear down the machine (so you don’t keep paying for it… I think the users, security groups and keypairs are free?)

To check on the IP address of your instance, run:

aws ec2 describe-instances

or check on the AWS EC2 console. You can also check the machine is ripped down correctly when you have finished with it from there.

(I need to check what happens if you vagrant suspend and then vagrant resume. Presumably, the state is preserved, but you are billed for storage, if not running time?)


*Alternatively, we could create a specific user with more limited credentials.

Create a user we can use to help set up the credentials to use with the vagrant provisioner:

aws iam create-user --user-name vagrant

Now we need to give that user permissions to build our EC2 instance, by attaching an appropriate security policy (AmazonEC2FullAccess). In other words, the Vagrantfile will make use of the AWS vagrant user to provision the machine, so we need to give that AWS user the appropraite permissions on AWS:

aws iam attach-user-policy --user-name vagrant --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

Generate some keys:

aws iam create-access-key --user-name vagrant

Then run aws configure with the new keys.


Course Apps in the the Cloud – Experimenting With Open Refine on Digital Ocean, Linode and AWS / Amazon EC2 Web Services

With OUr data management and analysis course coming up to its third presentation start in October, various revisions and updates are currently being made to the materials, in part based on feedback from students, in part based the module team’s reflections on how the course material is performing.

We also have an opportunity to update the virtual machine supplied to students, so I’ve spent the last couple of days poking around in the various script rewrites I’ve toyed with over the last couple of years. When we started the course, Jupyter notebooks were still called IPython notebooks, and the ecosystem was still in its infancy. But whilst the module review process means changes are supposed to be kept to a minimum, there is still an opportunity to bake a few more tools into the VM that didn’t exist a couple of years ago when the VM was first gold mastered. (I’ll do a review of some of the Jupyter notebook features that I think should be released into the VM in another post.)

When the VM was first put together, I took it as an opportunity to explore automated build processes. The VM itself was built from Puppet scripts orchestrated from Vagrant, with another Vagrant script managing the machine we delivered to students (setting up shared folders, handling port forwarding, and giving the internal services a kick if required). I also explored a dockerised version, but Docker too was still in its infancy when we first looked at how to best virtualise the services and apps distributed as part of the course materials (IPython/Jupyter notebooks, PostgreSQL, MongoDB and OpenRefine). With Docker now having native versions for recent Macs and Windows platforms, I thought it might be worth exploring again; but OUr student computing policy means we have to build to lowest common denominator machines that are years old (though I’m ignoring the 32 bit hardware platform constraint and we’ll post an online workaround – or ship a Raspberry Pi version of the VM – if we have to!).

So… to demo where I’m at in terms of process, and keep a note to myself, the build has forsaken Puppet and I’ve gone back to simple shell scripts. As an example of most of the tricks I’ve had to invoke, I’ll post recipes for getting OpenRefine up and running on several virtual hosts in several different ways. Still to do is a dockerised version and and RPi version of the TM351 VM config, but I’m hoping the shell scripts will all be reusable (and if not, I’ll try to tweak them so they work as is as part of whatever build process is required…

To begin with, the builder shell scripts are as follows (.sh files all end up requiring execute permissions granted somehow…).

Structure is:

./quickbuild/quick_build.sh
./quickbuild/basepackages.sh
./quickbuild/openrefine/openrefine.sh
./quickbuild/openrefine/services/refine.service

The main build script calls a script to add in base packages, and scripts for each application (in their own folder). I really should have had the same invocation filename or filename pattern (e.g. reusing the directory name) in each build folder.

## ./quickbuild/quick_build.sh
#chmod ugo+x on this file

#!/usr/bin/env bash
#Set the base build directory to the one containing this script
THISDIR=$(dirname "$0")

chmod ugo+x $THISDIR/basepackages.sh
chmod ugo+x $THISDIR/openrefine/openrefine.sh

#Build script for building machine
$THISDIR/basepackages.sh

$THISDIR/openrefine/openrefine.sh

#tidy up
apt-get autoremove -y && apt-get clean && updatedb

The base packages script does some updating of package lists and then pulls in a range of essential utility packages, some of which are actually required for builds further down the line.

## ./quickbuild/basepackages.sh

#!/usr/bin/env bash

#Build script for building machine
apt-get clean && apt-get -y update && apt-get -y upgrade && apt-get install -y bash-completion vim curl zip unzip bzip2 && apt-get install -y build-essential gcc && apt-get install -y g++ gfortran && apt-get install -y libatlas-base-dev libfreetype6-dev libpng-dev libhdf5-serial-dev && apt-get install -y git python3 python3-dev python3-pip && pip3 install --upgrade pip

The application build files install additional packages specific to the application or its build process. We had some issues with service starts in the original VM (Ubuntu 14.04 LTS), but the service management in Ubuntu 16.04 LTS is much cleaner – and in my own testing so far, much more reliable.

# ./quickbuild/openrefine/openrefine.sh
#!/bin/bash

THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

apt-get -y update && apt-get install -y wget ant unzip openjdk-8-jre-headless && apt-get clean -y

echo "Setting up OpenRefine: "

#Prep for download
mkdir -p /opt
mkdir -p /root

if [ ! -f /opt/openrefine.done ]; then
	echo "Downloading OpenRefine..."
	wget -q --no-check-certificate  -P /root https://github.com/OpenRefine/OpenRefine/releases/download/2.7-rc.2/openrefine-linux-2.7-rc.2.tar.gz
	echo "...downloaded OpenRefine"

	echo "Unpacking OpenRefine..."
	tar -xzf /root/openrefine-linux-2.7-rc.2.tar.gz -C /opt  && rm /root/openrefine-linux-2.7-rc.2.tar.gz
	#Unpacks to: /opt/openrefine-2.7-rc.2
	touch /opt/openrefine.done
	echo "...unpacked OpenRefine"
else
	echo "...already downloaded and unpacked OpenRefine"
fi

cp $THISDIR/services/refine.service /lib/systemd/system/refine.service

# Enable autostart
sudo systemctl enable refine.service

# Refresh service config
sudo systemctl daemon-reload

#(Re)start service
sudo systemctl restart refine.service

Applications are run as services, where possible. If I get a chance – and space/resource requirements allow – I made add some service monitoring to try to ensure application services are always running when the VM is running.

## ./quickbuild/openrefine/services/refine.service
[Unit]
Description=Refine

#When to bring the service up
#via https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
#Wait for a network stack to appear
After=network.target
#If we actually need the network to have a routable IP address:
#After=network-online.target 

[Service]
Environment=REFINE_HOST=0.0.0.0
ExecStart=/opt/openrefine-2.7-rc.2/refine -p 3334 -d /vagrant/openrefine_projects
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Everything can be packaged up in a zip file with a command (tuned to omit Mac cruft, in part) of the form:

zip -r quickbuild.zip quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

So those are the files and the basic outline. Our initial plan is to run the VMs once again locally on a student’s own machine, using Virtualbox. I think we’ll stick with vagrant to manage this, not least because we can issue updates via new Vagrantfiles, not that we’ve done that to date…

By the by, I’m running vagrant with a handful of plugins:

#Speed up repeated builds
vagrant plugin install vagrant-cachier

#Use correct Virtualbox Guest Additions
vagrant plugin install vagrant-vbguest

#Help with provisioning to virtual hosts
vagrant plugin install vagrant-digitalocean
vagrant plugin install vagrant-linode
vagrant plugin install vagrant-aws

The following Vagrantfile builds the local Virtualbox instance by default. To build to DOgital Ocean or Linode, use the following:

  • vagrant up --provider=digital_ocean
  • vagrant up --provider=linode

I didn’t get the AWS vagrant provisioner to work (too many things to go wrong in terms of settings!)

The Linode build also required a hack to get the box to build correctly…

# ./quickbuild/Vagrantfile

#Vagrantfile for building machine from build scripts

Vagrant.configure("2") do |config|

#------------------------- PROVIDER: VIRTUALBOX (BUILD) ------------------------------

  config.vm.provider :virtualbox do |virtualbox|

      #ubuntu/xenial bug? https://bugs.launchpad.net/cloud-images/+bug/1569237
      config.vm.box = "bento/ubuntu-16.04"
      #Stick with the default key
      config.ssh.insert_key=false

      #For local testing:
      #config.vm.box = "tm351basebuild"
      #override.vm.box_url = "eg URL on dropbox"
      #config.vm.box_url = "../boxes/test.box"

      config.vm.hostname = "tm351base"

      virtualbox.name = "tm351basebuildbuild"
      #We need the memory to install scipy and build indexes on seeded mongodb
      #After the build it can be reduced back down to 1024
      virtualbox.memory = 2048
      #virtualbox.cpus = 1
      # virtualbox.gui = true

      #---- START PORT FORWARDING ----
      #Registered ports: https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
      #openrefine
      config.vm.network :forwarded_port, guest: 3334, host: 35101, auto_correct: true

      #---- END PORT FORWARDING ----
    end

#------------------------- END PROVIDER: VIRTUALBOX (BUILD) ------------------------------

#------------------------- PROVIDER: DIGITAL OCEAN ------------------------------

config.vm.provider :digital_ocean do |provider, override|
		override.ssh.insert_key=true
        override.ssh.private_key_path = '~/.ssh/id_rsa'
        override.vm.box = 'digital_ocean'
        override.vm.box_url = "https://github.com/devopsgroup-io/vagrant-digitalocean/raw/master/box/digital_ocean.box"
        provider.token = 'YOUR_TOKEN'
        provider.image = 'ubuntu-16-04-x64'
        provider.region = 'lon1'
        provider.size = '2gb'

  end

#------------------------- END PROVIDER: DIGITAL OCEAN ------------------------------

#------------------------- PROVIDER: LINODE ------------------------------

config.vm.provider :linode do |provider, override|
    override.ssh.insert_key=true
    override.ssh.private_key_path = '~/.ssh/id_rsa'
    override.vm.box = 'linode/ubuntu1604'

    provider.api_key = 'YOUR KEY'
    provider.distribution = 'Ubuntu 16.04 LTS'
    provider.datacenter = 'london'
    provider.plan = 'Linode 2048'
    provider.size=2048

    #grub needs updating - but want's to do it interactively
    #this bit of voodoo from Stack Overflow hacks a non-interactive install of it
    override.vm.provision :shell, :inline => <<-SH
    	apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y -o DPkg::options::="--force-confdef" -o DPkg::options::="--force-confold"  install grub-pc
	SH

  end

#------------------------- END PROVIDER: LINODE ------------------------------

#------------------------- PROVIDER: AWS ------------------------------

  #  I DIDN'T GET THIS TO WORK - MAYBE SEVERAL THINGS WRONG HERE - AND IN AWS SETTINGS ????

  config.vm.provider :aws do |aws, override|
  	config.vm.hostname = "tm351aws"
  	#vagrant box add dummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box
    override.vm.box = "dummy"
    aws.access_key_id = ""
    aws.secret_access_key = ""

    #https://github.com/mitchellh/vagrant-aws/issues/405#issuecomment-130342371
    #Download and install the Amazon Command Line Interface
    #http://docs.aws.amazon.com/cli/latest/userguide/installing.html
    #Configure the command line interface
    #http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
    #$aws configure
    #Request the session token
    #$aws sts get-session-token --duration-seconds 129600 (enter your own duration)
    aws.session_token = ""

    #Keypair also generated via AWS console?
    aws.keypair_name = "vagrantAWSkeypair"

    aws.region = "eu-west-2a"
    aws.ami = "ami-ed908589"
    aws.instance_type="t2.small"

    override.ssh.username = "ubuntu"
    override.ssh.private_key_path =  '~/.ssh/id_rsa'

  end

  # NOTE THAT RUNNING THIS PROVISIONER MAY LEAVE THINGS BILL INCURRING ON AWS... SO CHECK

#------------------------- END PROVIDER: AWS ------------------------------

#------------------------------

  config.vm.provision :shell, :inline => <<-SH
  	#Add build scripts here
  	cd /vagrant/build
  	source ./quick_build.sh
  SH

end

(The vagrant script can be tidied to hide keys by setting eg export DIGITAL_OCEAN_TOKEN="YOUR TOKEN HERE" from the command line you call vagrant from, and in the Vagrantfile setting provider.token = ENV['DIGITAL_OCEAN_TOKEN']).)

One of the nice things about the current version of vagrant is that you have to destroy a machine before launching another one of the same name with a different provisioners (though this looks set to change in forthcoming versions of vagrant). Why nice? Because the vagrant destroy command kills the node the machine is running on – so it won’t be left running and you won’t forget to turn it off (and won’t keep the meter running….)

Firing up the boxes on various hosts, go to port 3334 at the appropriate IP address and you should see OpenRefine running there…

Having failed to get the machine up and running on AWS, I though I’d try the simple route of packaging an AMI using Packer.

The build script was remarkably simple – once I got one that worked!

#awspacker.json

{
  "variables": {
    "aws_access_key": "",
    "aws_secret_key": ""
  },
  "builders": [{
    "type": "amazon-ebs",
    "access_key": "{{user `aws_access_key`}}",
    "secret_key": "{{user `aws_secret_key`}}",
    "region": "eu-west-1",
    "source_ami": "ami-971238f1",
    "instance_type": "t2.micro",
    "ssh_username": "ubuntu",
    "ami_name": "openrefine",
    "security_group_id": "OPTIONAL_YOUR_VAGRANT_GROUP"
  }],

  "provisioners": [

    {
      "destination": "/tmp/",
      "source": "./toupload/",
      "type": "file"
    },
    {
      "inline": [
        "cd /tmp && sudo apt-get update && sudo apt-get install unzip && sudo unzip /tmp/quickbuild.zip -d /tmp && sudo chmod ugo+x /tmp/quickbuild/quick_build.sh && sudo /tmp/quickbuild/quick_build.sh "
      ],
      "type": "shell"
    }
  ]

}

(The eu-west-2 (London) region wasn’t recognised by Packer for some reason…)

The machine can now be built on AWS and packaged as an AMI using Packer as follows (top level security tokens can be generated from the AWS Security Credentials console):

#Package the build files
mkdir -p toupload && zip -r toupload/quickbuild.zip quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

#Pack the machine
packer build -var 'aws_access_key=YOUR_KEY' -var 'aws_secret_key=YOUR_SECRET' awspacker.json

Launching an instance of this AMI, I found that I couldn’t connect to the OpenRefine port (it just hung). The fix was to amend the automatically created security group rules (which by default just allow ssh on port 22) with a a Custom TCP rule that allowed incoming traffic on port 3334 from All Domains.

Which meant success:

To simplify matters, I then copied this edited security group to my own “openrefine” security group that I could use as the basis of the AMI packaging.

Just one thing to note about creating an AMI – Amazon will start billing you for it… As the Packer Getting Started guide suggests:

After running the above example, your AWS account now has an AMI associated with it. AMIs are stored in S3 by Amazon, so unless you want to be charged about $0.01 per month, you’ll probably want to remove it. Remove the AMI by first deregistering it on the AWS AMI management page. Next, delete the associated snapshot on the AWS snapshot management page.

Next up, I need to try a full build of the TM351 VM on AWS (a full build without the Mongo shard activity (which I couldn’t get to work yesterday – though this looks like it could provide a handy helper script (and I maybe also need to work through this.) The fuller build seems fine from the vagrant script in Virtualbox, Digital Ocean and Linode.

After that (and fixing the Mongo sharding thing), I’ll see if I can weave the build scripts into a set of interconnected Docker containers, one Dockerfile per application and a docker-compose.yml to weave them together. (See the original test from way back when.)

And then there’ll just be the look-see to see whether we can get the machine built and running on a Raspberry Pi 3 model B.

I also started wondering about whether I should pop a simple Flask app into the VM on port 80, showing an OU splash screen and a “Welcome to TM351” message… If I can get that running, then we have a means of piping stuff into a web page on the students’ own machines that is completely out of the controlling hands of LTS:-)

PS for an example of how to set up authentication over these services, see: Simple Authenticated Access to VM Services Using NGINX and Vagrant Port Forwarding.

Running Docker Container Compositions in the Cloud on Digital Ocean

With TM351 about to start, I thought I’d revisit the docker container approach to delivering the services required by the course to see if I could get something akin to the course software running in the cloud.

Copies of the dockerfiles used to create the images can be found on Github, with prebuilt images available on dockerhub (https://hub.docker.com/r/psychemedia/ou-tm351-*).

A couple of handy ones out of the can are:

That said, at the current time, the images are not intended for use as part of the course

The following docker-compose.yml file will create a set of linked containers that resemble (ish!) the monolithic VM we distributed to students as a Virtualbox box.

dockerui:
    container_name: tm351-dockerui
    image: dockerui/dockerui
    ports:
        - "35183:9000"
    volumes:
        - /var/run/docker.sock:/var/run/docker.sock
    privileged: true

devmongodata:
    container_name: tm351-devmongodata
    command: echo mongodb_created
    #Share same layers as the container we want to link to?
    image: mongo:3.0.7
    volumes: 
        - /data/db

mongodb:
    container_name: tm351-mongodb
    image: mongo:3.0.7
    ports:
        - "27017:27017"
    volumes_from:
        - devmongodata
    command: --smallfiles

mongodb-seed:
    container_name: tm351-mongodb-seed
    image: psychemedia/ou-tm351-mongodb-simple-seed
    links:
        - mongodb

devpostgresdata:
    container_name: tm351-devpostgresdata
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data
 
postgres:
    container_name: tm351-postgres
    environment:
        - POSTGRES_PASSWORD=PGPass
    image: psychemedia/ou-tm351-postgres
    ports:
        - "5432:5432"

openrefine:
    container_name: tm351-openrefine
    image: psychemedia/tm351-openrefine
    ports:
        - "35181:3333"
    privileged: true
    
notebook:
    container_name: tm351-notebook
    #build: ./tm351_scipystacknserver
    image: psychemedia/ou-tm351-pystack
    ports:
        - "35180:8888"
    links:
        - postgres:postgres
        - mongodb:mongodb
        - openrefine:openrefine
    privileged: true

Place a copy of the docker-compose.yml YAML file somewhere, from Kitematic, open the command line, cd into the directory containing the YAML file, and enter the command docker-compose up -d – the images are on dockerhub and should be downloaded automatically.

Refer back to Kitematic and you should see running containers – the settings panel for the notebooks container shows the address you can find the notebook server at.

kitematic

The notebooks and OpenRefine containers should also be linked to shared folders in the directory you ran the Docker Compose script from.

Running the Containers in the Cloud – Docker-Machine and Digital Ocean

As well as running the linked containers on my own machine, my real intention was to see how easy it would be to get them running in the cloud and using just the browser on my own computer to access them.

And it turns out to be really easy. The following example uses cloud host Digital Ocean.

To start with, you’ll need a Digital Ocean account with some credit in it and a Digital Ocean API token:

DigitalOcean_Control_Panel

(You may be able to get some Digital Ocean credit for free as part of the Github Education Student Developer Pack.)

Then it’s just a case of a few command line instructions to get things running using Docker Machine:

docker-machine ls
#kitematic usess: default

#Create a droplet on Digital Ocean
docker-machine create -d digitalocean --digitalocean-access-token YOUR_ACCESS_TOKEN --digitalocean-region lon1 --digitalocean-size 4gb ou-tm351-test 

#Check the IP address of the machine
docker-machine ip ou-tm351-test

#Display information about the machine
docker-machine env ou-tm351-test
#This returns necessary config details
#For example:
##export DOCKER_TLS_VERIFY="1"
##export DOCKER_HOST="tcp://IP_ADDRESS:2376"
##export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"
##export DOCKER_MACHINE_NAME="ou-tm351-test"
# Run this command to configure your shell: 
# eval $(docker-machine env ou-tm351-test)

#Set the environment variables as recommended
export DOCKER_TLS_VERIFY="1"
export DOCKER_HOST="tcp://IP_ADDRESS:2376"
export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"

#Run command to set current docker-machine
eval "$(docker-machine env ou-tm351-test)"

#If the docker-compose.yml file is in .
docker-compose up -d
#This will launch the linked containers on Digital Ocean

#The notebooks should now be viewable at:
#http://IP_ADDRESS:35180

#OpenRefine should now be viewable at:
#http://IP_ADDRESS:35181

#To stop the machine
docker-machine stop ou-tm351-test
#To remove the Digital Ocean droplet (so you stop paying for it...
docker-machine rm ou-tm351-test

#Reset the current docker machine to the Kitematic machine
eval "$(docker-machine env default)"

So that’s a start. Issues still arise in terms of persisting state, such as the database contents, notebook files* and OpenRefine projects: if you leave the containers running on Digital Ocean to persist the state, the meter will keep running.

(* I should probably also build a container that demos how to bake a few example notebooks into a container running the notebook server and TM351 python distribution.)

A Peek Inside the TM351 VM

So this is how I currently think of the TM351 VM:

OU-VM-June2015Review_pptx

What would be nice would be a drag’n’drop tool to let me draw pictures like that that would then generate the build scripts… (a docker compose script, or set of puppter scripts, for the architectural bits on the left, and a Vagrantfile to set up the port forwarding, for example).

For docker, I wouldn’t have thought that would be too hard – a docker compose file could describe most of that picture, right? Not sure how fiddly it would be for a more traditional VM, though, depending on how it was put together?

Exporting and Distributing Docker Images and Data Container Contents

Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.

One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.

docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz
docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz
docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz
docker save busybox:latest | gzip -c > test_busybox.tgz
docker save mongo:latest | gzip -c > test_mongo.tgz
docker save postgres:latest | gzip -c > test_postgres.tgz

On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:

docker load < test_mongo.tgz

assuming the test_mongo.tgz file is in the current working directory.

Another I need to explore is how to get the set up the data volume containers on the students’ machine.

The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)

The traditional backup and restore route for PostgreSQL seems to be something like the following:

#Use docker exec to run a postgres export
docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c &gt; dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql
#If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz

#The restore route would presumably be something like:
cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres
#For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres

For mongo, things seem to be a little bit more complicated. Something like:

docker exec -t vagrant_mongo_1 mongodump

#Complementary restore command is: mongorestore

would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?

By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:

#If you need to get into a container:
vagrant ssh
#Then in the VM:
  docker exec -it CONTAINERNAME bash

Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:

#  Export data from a data volume container that is linked to a database server

#postgres
docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data 

#I wonder if these should be run with --rm to dispose of the temporary container once run?

#mongo - BUT SEE CAVEAT BELOW
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db

We can then take the tar file, distribute it to students, and use it to seed a data volume container.

Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:

#Create a data volume container
docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true
#Restore the contents
docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar"
#Note - the docker helpfiles don't show how to use sh -c - which appears to be required...
#Again, I wonder whether this should be run with --rm somewhere to minimise clutter?

Unfortunately, things don’t seem to run so smoothly with mongo?

#Unfortunately, when trying to run a mongo server against a data volume container
#the presence of a mongod.lock seems to break things
#We probably shouldn't do this, but if the database has settled down and completed
#  all its writes, it should be okay?!
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock
#This generates a copy of the distributable file without the lock...

#Here's an example of the reconstitution from the distributable file for mongo
docker create -v /data/db --name devmongodata busybox true
docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"

(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)

If have a docker-compose.yml file in the working directory like the following:

mongo:
  image: mongo
  ports:
    - "27017:27017"
  volumes_from:
    - devmongodata

##We DO NOT need to declare the data volume here
#We have already created it
#Also, if we leave it in, a "docker-compose rm" command
#will destroy the data volume container...
#...which means we wouldn't persist the data in it
#devmongodata:
#    command: echo created
#    image: busybox
#    volumes: 
#        - /data/db

We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.

I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up

You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.

Hopefully!

Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?

A week or so ago I came across a couple of IPython notebooks produced by Catherine Devlin covering the maintenance and tuning of a PostgreSQL server: DB Introspection Notebook (example 1: introspection, example 2: tuning, example 3: performance checklist). One of the things we have been discussing in the TM351 course team meetings is the extent to which we “show our working” to students in terms how the virtual machine and the various databases used in the course were put together, even if we don’t actually teach that stuff.

Notebooks make an ideal way of documenting the steps taken to set up a particular system, blending commentary with command line as well as code executable cells.

The various approaches I’ve explored to build the VM have arguably been over-complex – vagrant, puppet, docker and docker-compose – but I’ve always seen the OU as a place where we explore the technologies we’re teaching – or might teach – in the context of both course production and course delivery (that is, we can often use a reflexive approach whereby the content of the teaching also informs the development and delivery of the teaching).

In contrast, in A DevOps Approach to Common Environment Educational Software Provisioning and Deployment I referred to a couple of examples of a far simpler approach, in which common research, or research and teaching, VM environments were put together using simple scripts. This approach is perhaps more straightforwardly composable, in that if someone has additional requirements of the machine, they can just add a simple configuration script to bring in those elements.

In our current course example, where the multiple authors have a range of skill and interest levels when it comes to installing software and exploring different approaches to machine building, I’m starting to wonder whether I should have started with a simple base machine running just an IPython notebook server and no additional libraries or packages, and then created series of notebooks, one for each part of the course (which broadly breaks down to one part per author), containing instructions for installing all the bits and pieces required for just that part of the course. If there’s duplication across parts, trying to install the same thing for each part, that’s fine – the various package managers should be able to cope with that. (The only issue would arise if different authors needed different versions of the same package, for some reason, and I’m not sure what we’d do in that case?)

The notebooks would then include explanatory documentation and code cells to install Linux packages and python packages. Authors could take over the control of setup notebooks, or just make basic requests. At some level, we might identify a core offering (for example, in our course, this might be the inclusion of the pandas package) that might be pushed up into a core configuration installation notebook executed prior to the installation notebook for each part.

Configuring the machine would then be a case of running the separate configuration notebooks for each part (perhaps preceded by a core configuration notebook), perhaps by automated means. For example, ipython nbconvert --to=html --ExecutePreprocessor.enabled=True configNotebook_1.ipynb will [via StackOverflow]. This generates an output HTML report from running the code cells in the notebook (which can include command line commands) in a headless IPython process (I think!).

The following switch may also be useful (it clears the output cells): ipython nbconvert --to=pdf --ExecutePreprocessor.enabled=True --ClearOutputPreprocessor.enabled=True RunMe.ipynb (note in this case we generate a PDF report).

To build the customised VM box, the following route should work:

– set up a simple Vagrant file to import a base box
– install IPython into the box
– copy the configuration notebooks into the box
– run the configuration notebooks
– export the customised box

This approach has the benefits of using simple, literate configuration scripts described within a notebook. This makes them perhaps a little less “hostile” than shell scripts, and perhaps makes it easier to build in tests inline, and report on them nicely. (If a running a cell results in an error, I think the execution of the notebook will stop at that point?) The downside is that to run the notebooks, we also need to have IPython installed first.

A DevOps Approach to Common Environment Educational Software Provisioning and Deployment

In Distributing Software to Students in a BYOD Environment, I briefly reviewed a paper that described a paper that reported on the use of Debian metapackages to support the configuration of Linux VMs for particular courses (each course has its own Debian metapackage that could install all the packages required for that course).

This idea of automating the build of machines comes under the wider banner of DevOps (development and operations). In a university setting, we might view this in several ways:

  • the development of course related software environments during course production, the operational distribution and deployment of software to students, updating and support of the software in use, and maintenance and updating of software between presentations of a course;
  • the development of software environments for use in research, the operation of those environments during the lifetime of a resarch project, and the archiving of those environments;
  • the development and operation of institutional IT services.

In an Educause review from 2014 (Cloud Strategy for Higher Education: Building a Common Solution, EDUCAUSE Center for Analysis and Research (ECAR) Research Bulletin, November, 2014 [link]), a pitch for universities making greater use of cloud services, the authors make the observation that:

In order to make effective use of IaaS [Infrastructure as a Service], an organization has to adopt an automate-first mindset. Instead of approaching servers as individual works of art with artisan application configurations, we must think in terms of service-level automation. From operating system through application deployment, organizations need to realize the ability to automatically instantiate services entirely from source code control repositories.

This is the approach I took from the start when thinking about the TM351 virtual machine, focussing more on trying to identify production, delivery, support and maintenance models that might make sense in a factory production model that should work in a scaleable way not only across presentations of the same course, as well as across different courses, but also across different platforms (students own devices, OU managed cloud hosts, student launched commercial hosts) rather than just building a bespoke, boutique VM for a single course. (I suspect the module team would have preferred my focussing on the latter – getting something that works reliably, has been rigorously tested, and can be delivered to students rather than pfaffing around with increasingly exotic and still-not-really-invented-yet tooling that I don’t really understand to automate production of machines from scratch that still might be a bit flaky!;-)

Anyway, it seems that the folk at Berkeley have been putting together a “Common Scientific Compute Environment for Research and Education” [Clark, D., Culich, A., Hamlin, B., & Lovett, R. (2014). BCE: Berkeley’s Common Scientific Compute Environment for Research and Education, Proceedings of the 13th Python in Science Conference (SciPy 2014).]


The BCE – Berkeley Common Environment – is “a standard reference end-user environment” consisting of a simply skinned Linux desktop running in virtual machine delivered as a Virtualbox appliance that “allows for targeted instructions that can assume all features of BCE are present. BCE also aims to be stable, reliable, and reduce complexity more than it increases it”. The development team adopted a DevOps style approach customised for the purposes of supporting end-user scientific computing, arising from the recognition that they “can’t control the base environment that users will have on their laptop or workstation, nor do we wish to! A useful environment should provide consistency and not depend on or interfere with users’ existing setup”, further “restrict[ing] ourselves to focusing on those tools that we’ve found useful to automate the steps that come before you start doing science. Three main frames of reference were identified:

  • instructional: students could come from all backgrounds and often unlikely to have sys admin skills over and above the ability to use a simple GUI approach to software installation: “The most accessible instructions will only require skills possessed by the broadest number of people. In particular, many potential students are not yet fluent with notions of package management, scripting, or even the basic idea of commandline interfaces. … [W]e wish to set up an isolated, uniform environment in its totality where instructions can provide essentially pixel-identical guides to what the student will see on their own screen.”
  • scientific collaboration: that is, the research context: “It is unreasonable to expect any researcher to develop code along with instructions on how to run that code on any potential environment.” In addition, “[i]t is common to encounter a researcher with three or more Python distributions installed on their machine, and this user will have no idea how to manage their command-line path, or which packages are installed where. … These nascent scientific coders will have at various points had a working system for a particular task, and often arrive at a state in which nothing seems to work.”
  • Central support: “The more broadly a standard environment is adopted across campus, the more familiar it will be to all students”, with obvious benefits when providing training or support based on the common environment.

Whilst it was recognised the personal laptop computers are perhaps the most widely used platform, the team argued that the “environment should not be restricted to personal computers”. Some scientific computing operations are likely to stretch the resources of a personal laptop, so the environment should also be capable of running on other platforms such as hefty workstations or on a scientific computing cloud.

The first consideration was to standardise on an O/S: Linux. Since the majority of users don’t run Linux machines, this required the use of a virtual machine (VM) to host the Linux system, whilst still recognising that “one should assume that any VM solution will not work for some individuals and provide a fallback solution (particularly for instructional environments) on a remote server”.

Another issue that can arise is dealing with mappings between host and guest OS, which vary from system to system – arguing for the utility of an abstraction layer for VM configuration like Vagrant or Packer … . This includes things like portmapping, shared files, enabling control of the display for a GUI vs. enabling network routing for remote operation. These settings may also interact with the way the guest OS is configured.

Reflecting on the “traditional” way of building a computing environment, the authors argued for a more automated approach:

Creating an image or environment is often called provisioning. The way this was done in traditional systems operation was interactively, perhaps using a hybrid of GUI, networked, and command-line tools. The DevOps philosophy encourages that we accomplish as much as possible with scripts (ideally checked into version control!).

The tools explored included Ansible, packer, vagrant and docker:

  • Ansible: to declare what gets put into the machine (alternatives include shell scripts, puppet etc. (For the TM351 monolithic VM, I used puppet.) End-users don’t need to know anything about Ansible, unless they want to develop a new, reproducible, custom environment.
  • packer: used to run the provisioners and construct and package up a base box. Again, end-users don’t need to know anything about this. (For the TM351 monolithic VM, I used vagrant to build a basebox in Virtualbox, and then package it; the power of Packer is that is lets you generate builds from a common source for a variety of platforms (AWS, Virtualbox, etc etc).)
  • vagrant: their description is quite a handy one: “a wrapper around virtualization software that automates the process of configuring and starting a VM from a special Vagrant box image … . It is an alternative to configuring the virtualization software using the GUI interface or the system-specific command line tools provided by systems like VirtualBox or Amazon. Instead, Vagrant looks for a Vagrantfile which defines the configuration, and also establishes the directory under which the vagrant command will connect to the relevant VM. This directory is, by default, synced to the guest VM, allowing the developer to edit the files with tools on their host OS. From the command-line (under this directory), the user can start, stop, or ssh into the Vagrant-managed VM. It should be noted that (again, like Packer) Vagrant does no work directly, but rather calls out to those other platform-specific command-line tools.” However, “while Vagrant is conceptually very elegant (and cool), we are not currently using it for BCE. In our evaluation, it introduced another piece of software, requiring command-line usage before students were comfortable with it”. This is one issue we are facing with the TM351 VM – current the requirement to use vagrant to manage the VM from the commandline (albeit this only really requires a couple of commands – we can probably get away with just: vagrant up && vagrant provision and vagrant suspend – but also has a couple of benefits, like being able to trivially vagrant ssh in to the VM if absolutely necessary…).
  • docker: was perceived as adding complexity, both computationally and conceptually: “Docker requires a Linux environment to host the Docker server. As such, it clearly adds additional complexity on top of the requirement to support a virtual machine. … the default method of deploying Docker (at the time of evaluation) on personal computers was with Vagrant. This approach would then also add the complexity of using Vagrant. However, recent advances with boot2docker provide something akin to a VirtualBox-only, Docker-specific replacement for Vagrant that eliminates some of this complexity, though one still needs to grapple with the cognitive load of nested virtual environments and tooling.” The recent development of Kitematic addresses some of the use-case complexity, and also provides GUI based tools for managing some of the issues described above associate with portmapping, file sharing etc. Support for linked container compositions (using Docker Compose) is still currently lacking though…

At the end of the day, Packer seems to rule them all – coping as it does with simple installation scripts and being able to then target the build for any required platform. The project homepage is here: Berkeley Common Environment and the github repo here: Berkeley Common Environment (Github).

The paper also reviewed another common environment – OSGeo. Once again built on top of a common Linux base, well documented shell scripts are used to define package installations: “[n]otably, the project uses none of the recent DevOps tools. OSGeo-Live is instead configured using simple and modular combinations of Python, Perl and shell scripts, along with clear install conventions and examples. Documentation is given high priority. … Scripts may call package managers, and generally have few constraints (apart from conventions like keeping recipes contained to a particular directory)”. In addition, “[s]ome concerns, like port number usage, have to be explicitly managed at a global level”. This approach contrasts with the approach reviewed in Distributing Software to Students in a BYOD Environment where Debian metapackages were used to create a common environment installation route.


The idea of a common environment is a good one, and that would work particularly well in a curriculum such as Computing, I think? One main difference between the BCE approach and the TM351 approach is that BCE is self-contained and runs a desktop environment within the VM, whereas the TM351 environment uses a headless VM and follows more of a microservice approach that publishes HTML based service UIs via http ports that can be viewed in a browser. One disadvantage of the latter approach is that you need to keep a more careful eye on port assignments (in the event of collisions) when running the VM locally.

Distributing Software to Students in a BYOD Environment

Reading around a variety of articles on the various ways of deploying software in education, it struck me that in traditional institutions a switch is may be taking place between students making use of centrally provided computing services – including physical access to desktop computers – to students bringing their own devices on which they may want to run the course software themselves. In addition, traditional universities are also starting to engage increasingly with their own distance education students; and the rise of the MOOCs are based around the idea of online course provision – that is, distance education.

The switch from centrally provided computers to a BYOD regime contrasts with the traditional approach in distance education in which students traditionally provided their own devices and onto which they installed software packaged and provided by their educational institution. That is, distance education students have traditionally been BYOD users.

However, in much the same way that the library in a distance education institution like the OU could not originally provide physical information (book lending) services to students, instead brokering access agreements with other HE libraries, but now can provide a traditional a traditional library service through access to digital collections, academic computing services are perhaps now more in a position where they can provide central computing services, at scale, to their students. (Contributory factors include: readily available network access for students, cheaper provider infrastructure costs (servers, storage, bandwidth, etc).)

With this in mind, it is perhaps instructive for those of us working in distance education to look at how the traditional providers are coping with an an influx of BYOD users, and how they are managing access to, and the distribution of, software to this newly emerging class of user (for them) whilst at the same time continuing to provide access to managed facilities such as computing labs and student accessed machines.


Notes from: Supporting CS Education via Virtualization and Packages – Tools for Successfully Accommodating “Bring-Your-Own-Device” at Scale, Andy Sayler, Dirk Grunwald, John Black, Elizabeth White, and Matthew Monaco SIGCSE’14, March 5–8, 2014, Atlanta, GA, USA [PDF]

The authors describe “a standardized development environment for all core CS courses across a range of both school-owned and student-owned computing devices”, leveraging “existing off-the-shelf virtualization and software management systems to create a common virtual machine that is used across all of our core computer science courses”. The goal was to “provide students with an easy to install and use development environment that they could use across all their CS courses. The development environment should be available both on department lab machines, and as a VM for use on student-owned machines (e.g. as a ‘lab in a box’).”

From the student perspective, our solution had to: a) Run on a range of host systems; b) Be easy to install; c) Be easy to use and maintain; d) Minimize side-effects on the host system; e) Provide a stable experience throughout the semester.

From the instructor perspective, our solution had to: a) Keep the students happy; b) Minimize instructor IT overhead; c) Provide consistent results across student, grader, and instructor machines; d) Provide all necessary software for the course; e) Provide the ability to update software as the course progresses.

Virtualbox was adopted on the grounds that it runs cross-platform, is free, open source software, and has good support for running Linux guest machines. The VM was based on Ubuntu 12.04 (presumably the long term edition available at the time) and distributed as an .ova image.

To support the distribution of software packages for a particular course, Debian metapackages (that simply list dependencies; in passing, I note that the Anaconda python distribution supports the notion of python (conda) metapackages, but pip does not, specifically?) were created on a per course basis that could be used via apt-get to install all the necessary packages required for a particular course (example package files).

In terms of student support, the team published “a central web-page that provides information about the VM, download links, installation instructions, common troubleshooting steps, and related self-help information” along with “YouTube videos describing the installation and usage of the VM”. Initial distribution is provided using BitTorrent. Where face-to-face help sessions are required, VM images are provided on USB memory sticks to avoid download time delays. Backups are handled by bundling Dropbox into the VM and encouraging students to place their files there. (Github is also used.)

The following observation is useful in respect of student experience of VM performance:

“Modern CPUs provide extensions that enable a fast, smooth and enjoyable VM experience (i.e. VT-x). Unfortunately, many non-Apple PC manufacturers ship their machines with these extension disabled in the BIOS. Getting students to enable these extensions can be a challenge, but makes a big difference in their overall impression of VM usability. One way to force students to enable these extensions is to use a 64-bit and/or multi-core VM, which VirtualBox will not start without virtualization extensions enabled.”

The open issues identified by the team are the issue of virtualisation support; corrupted downloads of the VM (mitigation includes publishing a checksum for the VM and verifying against this); and the lack of a computer capable of running the VM (ARM devices, low specification Intel Atom computers). [On this latter point, it may be worth highlighting the distinction between hardware that cannot cope with running computationally intensive applications, hardware that has storage limitations, and hardware that cannot run particular virtualisation services (for example, that cannot run x86 virtualisation). See also: What Happens When “Computers” Are Replaced by Tablets and Phones?]


The idea of using package management is attractive, and contrasts with the approach I took when hacking together the TM351 VM using vagrant and puppet scripts. It might make sense to further abstract the machine components into a Debian metapackage and a simple python/pip “meta” package (i.e. one that simply lists dependencies). The result would be an installation reduced to a couple of lines of the form:

apt-get install ou_tm351=15J.0
pip install ou_tm351==15J.0

where packages are versioned to a particular presentation of an OU course, with a minor version number to accommodate any updates/patches. One downside to this approach is that it splits co-dependency relationships between python and Debian packages relative to a particular application. In the current puppet build files for the monolithic VM build, each application has its own puppet file that installs the additional libraries over base libraries required for a particular application. (In addition, particular applications can specify dependencies on base libraries.) For the dockerised VM build, each container image has it’s own Dockerfile that identifies the dependencies for that image.

Tracing its history (and reflecting the accumulated clutter of my personal VM learning journey!) the draft TM351 VM is currently launched and provisioned using vagrant, partly because I can’t seem to start the IPython Notebook reliably from a startup script:-( Distributing the machine as a start/stoppable appliance (i.e. as an Open Virtualization Format/.ova package) might be more convenient, if we could guarantee that file sharing with host works as required (sharing against a specific folder on host) and any port collisions experienced by the provided services can be managed and worked around?

Port collisions are less of an issue for Sayler et al. because their model is that students will be working within the VM context – a “desktop as a local service” (or “platform as a local service” model); the TM351 VM model provides services that run within the VM, some of which are exposed via http to the host – more of a “software as a local service” model. In the cloud, software-as-a-service and desktop-as-a-service models are end-user delivery models, where users access services through a browser or lightweight desktop client, compared with “platform-as-a-service” offerings where applications can be developed and delivered within a managed development environment offering high level support services, or “infrastructure as a service” offerings, which provide access to base computing components (computational processing, storage, networking, etc.)

Note that what interests me particularly are delivery models that support all three of the following models: BYOD, campus lab, and cloud/remotely hosted offerings (as a crude shorthand, I use ‘cloud’ to mean environments that are responsive in terms of firing up servers to meet demand). The notions of personal computing environments, course computing environments and personal course computing environments might also be useful, (for example, a course computing environment might be a generic container populated with course software, a personal course computing container might then be a container linked to a student’s identity, with persisted state and linked storage, or a course container running on a students own device) alongside research computing environments and personal research computing environments.

Rethinking the TM351 Virtual Machine Again, Again…

It’s getting to that time when we need to freeze the virtual machine build we’re going to use for the new (postponed) data course, which should hopefully go live to students in February, 2016, and I’ve been having a rethink about how to put it together.

The story so far has been documented in several blog posts and charts my learning journey from knowing nothing about virtual machines (not sure why I was given the task of putting it together?!) to knowing how little I know about building Linux administration, PostgreSQL, MongoDB, Linux networking, virtual machines and virtualisation (which is to say, knowing I don’t know enough to do any of this stuff properly…;-)

The original plan was to put everything into a single VM and wire all the bits together. One of the activities needed to fire up several containers as part of a mongo replica set, and I opted to use containers to do that.

Over the last few months, I started to wonder whether we should containerise everything separately, then deploy compositions of containers. The rationale behind this approach is that it means we could make use of a single VM to host applications for several users if we get as far as cloud hosting services/applications for out students. It also means students can start, stop or “reinstall” particular applications in isolation from the other VM applications they’re running.

I think I’ve got this working in part now, though it’s still very much tied to the single user – I’m doing things with permissions that would never be allowed (and that would possibly break things..) if we were running multiple users in the same VM.

So what’s the solution? I posted the first hints in Kiteflying Around Containers – A Better Alternative to Course VMs? where I proved to myself I could fire up an IPthyon notebook server on top of scientific distribution stack, and get the notebooks talking to a DBMS running in another container. (This was point and click easy, once you know what to click and what numbers to put where.)

The next step was to see if I could automate this in some way. As Kitematic is still short of a Windows client, and doesn’t (yet?) support Docker Compose, I thought I’d stick with vagrant (which I was using to build the original VM using a Puppet provision and puppet scripts for each app) and see if I could get it provision a VM to run containerised apps using docker. There are still a few bits to do – most notably trying to get the original dockerised mongodb stuff working, checking the mongo link works, working out where to try to persist the DBMS data files (possibly in a shared folder on host?) in a way that doesn’t trash them each time a DBMS container is started, and probably a load of other stuff – but the initial baby steps seem promising…

In the original VM, I wanted to expose a terminal through the browser, which meant pfaffing around with tty.js and node.js. The latest Jupyter server includes the ability to launch a browser based shell client, which meant I could get rid of tty.js. However, moving the IPython notebook into a container means that the terminal presumably has scope only within that container, rather than having access to the base VM command line? For various reasons, I intend to run the IPython/Jupyter notebook server container as a privileged container, which means it can reach outside the container (I think? The reason? eg to fire up containers for the mongo replica set activity) but I’m not sure if this applies to the command line/terminal app too? Though offhand, I can’t think why we might want to provide students with access to the base VM command line?

Anyway, the local set-up looks like this…

A simple Vagrantfile, called using vagrant up or vagrant reload. I have extended vagrant using the vagrant-docker-compose plugin that supports Docker Compose (fig, as was) and lets me fired up wired-together container configurations from a single script:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.network(:forwarded_port, guest: 9000, host: 9000)
  config.vm.network(:forwarded_port, guest: 8888, host: 8351,auto_correct: true)

  config.vm.provision :docker
  config.vm.provision :docker_compose, yml: "/vagrant/docker-compose.yml", rebuild: true, run: "always"
end

The YAML file identifies the containers I want to run and the composition rules between them:

ui:
  image: dockerui/dockerui
  ports:
    - "9000:9000"
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  privileged: true

ipynb:
  build: ./tm351_scipystacknserver
  ports:
    - "8888:8888"
  volumes:
    - ./notebooks/:/notebooks/
  links:
    - devpostgres:postgres
  privileged: true
    
devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data
 
devpostgres:
    environment:
        - POSTGRES_PASSWORD=whatever
    image: postgres
    ports:
        - "5432:5432"
    volumes_from:
        - devpostgresdata

At the moment, Mongo is still missing and I haven’t properly worked out what to do with the PostgreSQL datastore – the idea is that students will be given a pre-populated, pre-indexed database, in part at least.

One additional component that sort of replaces the command line/terminal app requirement from the original VM is the dockerui app. This runs in its own container with privileged access to the docker environment and that provides a simple control panel over all the containers:

DockerUI

What else? The notebook stuff has a shared notebooks directory with host, and is built locally (from a Dockerfile in the local tm351_scipystacknserver directory) on top of the ipython/scipystack image; extensions include some additional package installations (requiring both apt-get and pip installs) and copying across and running a custom IPython notebook template configuration.

FROM ipython/scipystack

MAINTAINER OU

ADD build_tm351_stack.sh /tmp/build_tm351_stack.sh
RUN bash /tmp/build_tm351_stack.sh


ADD ipynb_style /tmp/ipynb_style
ADD ipynb_custom.sh /tmp/ipynb_custom.sh
RUN bash /tmp/ipynb_custom.sh


## Extremely basic test of install
RUN python2 -c "import psycopg2, sqlalchemy"
RUN python3 -c "import psycopg2, sqlalchemy"

# Clean up from build
RUN rm -f /tmp/build_tm351_stack.sh
RUN rm -f /tmp/ipynb_custom.sh
RUN rm -f -r /tmp/ipynb_style

VOLUME /notebooks
WORKDIR /notebooks

EXPOSE 8888

ADD notebook.sh /
RUN chmod u+x /notebook.sh

CMD ["/notebook.sh"]

demo-tm351

If we need to extend the PostgreSQL build, that can be presumably done using a Dockerfile that pulls in the core image and then runs an additional configuration script over it?

So where am I at? No f****g idea. I thought that between the data course and the new web apps course we might be able to explore some interesting models of using virtual machines (originally) and containers (more recently) in a distance education setting, that could cope with single user home use, computer training room/lab use, cloud use, but, as ever, I have spectacularly failed to demonstrate any sort of “academic leadership” in developing these ideas within the OU, or even getting much of a conversation going in the first place. Not in my skill set, I guess!;-) Though perhaps not in the institution’s interests either. Recamp. Retrench. Lockdown. As per some of the sentiments in Reflections on the Closure of Yahoo Pipes, perhaps? Don’t Play Here.