Category: OU2.0

Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?

A week or so ago I came across a couple of IPython notebooks produced by Catherine Devlin covering the maintenance and tuning of a PostgreSQL server: DB Introspection Notebook (example 1: introspection, example 2: tuning, example 3: performance checklist). One of the things we have been discussing in the TM351 course team meetings is the extent to which we “show our working” to students in terms how the virtual machine and the various databases used in the course were put together, even if we don’t actually teach that stuff.

Notebooks make an ideal way of documenting the steps taken to set up a particular system, blending commentary with command line as well as code executable cells.

The various approaches I’ve explored to build the VM have arguably been over-complex – vagrant, puppet, docker and docker-compose – but I’ve always seen the OU as a place where we explore the technologies we’re teaching – or might teach – in the context of both course production and course delivery (that is, we can often use a reflexive approach whereby the content of the teaching also informs the development and delivery of the teaching).

In contrast, in A DevOps Approach to Common Environment Educational Software Provisioning and Deployment I referred to a couple of examples of a far simpler approach, in which common research, or research and teaching, VM environments were put together using simple scripts. This approach is perhaps more straightforwardly composable, in that if someone has additional requirements of the machine, they can just add a simple configuration script to bring in those elements.

In our current course example, where the multiple authors have a range of skill and interest levels when it comes to installing software and exploring different approaches to machine building, I’m starting to wonder whether I should have started with a simple base machine running just an IPython notebook server and no additional libraries or packages, and then created series of notebooks, one for each part of the course (which broadly breaks down to one part per author), containing instructions for installing all the bits and pieces required for just that part of the course. If there’s duplication across parts, trying to install the same thing for each part, that’s fine – the various package managers should be able to cope with that. (The only issue would arise if different authors needed different versions of the same package, for some reason, and I’m not sure what we’d do in that case?)

The notebooks would then include explanatory documentation and code cells to install Linux packages and python packages. Authors could take over the control of setup notebooks, or just make basic requests. At some level, we might identify a core offering (for example, in our course, this might be the inclusion of the pandas package) that might be pushed up into a core configuration installation notebook executed prior to the installation notebook for each part.

Configuring the machine would then be a case of running the separate configuration notebooks for each part (perhaps preceded by a core configuration notebook), perhaps by automated means. For example, ipython nbconvert --to=html --ExecutePreprocessor.enabled=True configNotebook_1.ipynb will [via StackOverflow]. This generates an output HTML report from running the code cells in the notebook (which can include command line commands) in a headless IPython process (I think!).

The following switch may also be useful (it clears the output cells): ipython nbconvert --to=pdf --ExecutePreprocessor.enabled=True --ClearOutputPreprocessor.enabled=True RunMe.ipynb (note in this case we generate a PDF report).

To build the customised VM box, the following route should work:

– set up a simple Vagrant file to import a base box
– install IPython into the box
– copy the configuration notebooks into the box
– run the configuration notebooks
– export the customised box

This approach has the benefits of using simple, literate configuration scripts described within a notebook. This makes them perhaps a little less “hostile” than shell scripts, and perhaps makes it easier to build in tests inline, and report on them nicely. (If a running a cell results in an error, I think the execution of the notebook will stop at that point?) The downside is that to run the notebooks, we also need to have IPython installed first.

Distributing Software to Students in a BYOD Environment

Reading around a variety of articles on the various ways of deploying software in education, it struck me that in traditional institutions a switch is may be taking place between students making use of centrally provided computing services – including physical access to desktop computers – to students bringing their own devices on which they may want to run the course software themselves. In addition, traditional universities are also starting to engage increasingly with their own distance education students; and the rise of the MOOCs are based around the idea of online course provision – that is, distance education.

The switch from centrally provided computers to a BYOD regime contrasts with the traditional approach in distance education in which students traditionally provided their own devices and onto which they installed software packaged and provided by their educational institution. That is, distance education students have traditionally been BYOD users.

However, in much the same way that the library in a distance education institution like the OU could not originally provide physical information (book lending) services to students, instead brokering access agreements with other HE libraries, but now can provide a traditional a traditional library service through access to digital collections, academic computing services are perhaps now more in a position where they can provide central computing services, at scale, to their students. (Contributory factors include: readily available network access for students, cheaper provider infrastructure costs (servers, storage, bandwidth, etc).)

With this in mind, it is perhaps instructive for those of us working in distance education to look at how the traditional providers are coping with an an influx of BYOD users, and how they are managing access to, and the distribution of, software to this newly emerging class of user (for them) whilst at the same time continuing to provide access to managed facilities such as computing labs and student accessed machines.


Notes from: Supporting CS Education via Virtualization and Packages – Tools for Successfully Accommodating “Bring-Your-Own-Device” at Scale, Andy Sayler, Dirk Grunwald, John Black, Elizabeth White, and Matthew Monaco SIGCSE’14, March 5–8, 2014, Atlanta, GA, USA [PDF]

The authors describe “a standardized development environment for all core CS courses across a range of both school-owned and student-owned computing devices”, leveraging “existing off-the-shelf virtualization and software management systems to create a common virtual machine that is used across all of our core computer science courses”. The goal was to “provide students with an easy to install and use development environment that they could use across all their CS courses. The development environment should be available both on department lab machines, and as a VM for use on student-owned machines (e.g. as a ‘lab in a box’).”

From the student perspective, our solution had to: a) Run on a range of host systems; b) Be easy to install; c) Be easy to use and maintain; d) Minimize side-effects on the host system; e) Provide a stable experience throughout the semester.

From the instructor perspective, our solution had to: a) Keep the students happy; b) Minimize instructor IT overhead; c) Provide consistent results across student, grader, and instructor machines; d) Provide all necessary software for the course; e) Provide the ability to update software as the course progresses.

Virtualbox was adopted on the grounds that it runs cross-platform, is free, open source software, and has good support for running Linux guest machines. The VM was based on Ubuntu 12.04 (presumably the long term edition available at the time) and distributed as an .ova image.

To support the distribution of software packages for a particular course, Debian metapackages (that simply list dependencies; in passing, I note that the Anaconda python distribution supports the notion of python (conda) metapackages, but pip does not, specifically?) were created on a per course basis that could be used via apt-get to install all the necessary packages required for a particular course (example package files).

In terms of student support, the team published “a central web-page that provides information about the VM, download links, installation instructions, common troubleshooting steps, and related self-help information” along with “YouTube videos describing the installation and usage of the VM”. Initial distribution is provided using BitTorrent. Where face-to-face help sessions are required, VM images are provided on USB memory sticks to avoid download time delays. Backups are handled by bundling Dropbox into the VM and encouraging students to place their files there. (Github is also used.)

The following observation is useful in respect of student experience of VM performance:

“Modern CPUs provide extensions that enable a fast, smooth and enjoyable VM experience (i.e. VT-x). Unfortunately, many non-Apple PC manufacturers ship their machines with these extension disabled in the BIOS. Getting students to enable these extensions can be a challenge, but makes a big difference in their overall impression of VM usability. One way to force students to enable these extensions is to use a 64-bit and/or multi-core VM, which VirtualBox will not start without virtualization extensions enabled.”

The open issues identified by the team are the issue of virtualisation support; corrupted downloads of the VM (mitigation includes publishing a checksum for the VM and verifying against this); and the lack of a computer capable of running the VM (ARM devices, low specification Intel Atom computers). [On this latter point, it may be worth highlighting the distinction between hardware that cannot cope with running computationally intensive applications, hardware that has storage limitations, and hardware that cannot run particular virtualisation services (for example, that cannot run x86 virtualisation). See also: What Happens When “Computers” Are Replaced by Tablets and Phones?]


The idea of using package management is attractive, and contrasts with the approach I took when hacking together the TM351 VM using vagrant and puppet scripts. It might make sense to further abstract the machine components into a Debian metapackage and a simple python/pip “meta” package (i.e. one that simply lists dependencies). The result would be an installation reduced to a couple of lines of the form:

apt-get install ou_tm351=15J.0
pip install ou_tm351==15J.0

where packages are versioned to a particular presentation of an OU course, with a minor version number to accommodate any updates/patches. One downside to this approach is that it splits co-dependency relationships between python and Debian packages relative to a particular application. In the current puppet build files for the monolithic VM build, each application has its own puppet file that installs the additional libraries over base libraries required for a particular application. (In addition, particular applications can specify dependencies on base libraries.) For the dockerised VM build, each container image has it’s own Dockerfile that identifies the dependencies for that image.

Tracing its history (and reflecting the accumulated clutter of my personal VM learning journey!) the draft TM351 VM is currently launched and provisioned using vagrant, partly because I can’t seem to start the IPython Notebook reliably from a startup script:-( Distributing the machine as a start/stoppable appliance (i.e. as an Open Virtualization Format/.ova package) might be more convenient, if we could guarantee that file sharing with host works as required (sharing against a specific folder on host) and any port collisions experienced by the provided services can be managed and worked around?

Port collisions are less of an issue for Sayler et al. because their model is that students will be working within the VM context – a “desktop as a local service” (or “platform as a local service” model); the TM351 VM model provides services that run within the VM, some of which are exposed via http to the host – more of a “software as a local service” model. In the cloud, software-as-a-service and desktop-as-a-service models are end-user delivery models, where users access services through a browser or lightweight desktop client, compared with “platform-as-a-service” offerings where applications can be developed and delivered within a managed development environment offering high level support services, or “infrastructure as a service” offerings, which provide access to base computing components (computational processing, storage, networking, etc.)

Note that what interests me particularly are delivery models that support all three of the following models: BYOD, campus lab, and cloud/remotely hosted offerings (as a crude shorthand, I use ‘cloud’ to mean environments that are responsive in terms of firing up servers to meet demand). The notions of personal computing environments, course computing environments and personal course computing environments might also be useful, (for example, a course computing environment might be a generic container populated with course software, a personal course computing container might then be a container linked to a student’s identity, with persisted state and linked storage, or a course container running on a students own device) alongside research computing environments and personal research computing environments.

What Happens When “Computers” Are Replaced by Tablets and Phones?

With personal email services managed online since what feels like forever (and probably is “forever”, for many users), personally accessed productivity apps delivered via online services (perhaps with some minimal support for in-browser, offline use) – things like Microsoft Office Online or Google Docs – video and music services provided via online streaming services, rather than large file downloads, image galleries stored in the cloud and social networking provided exclusively online, and in the absence of data about connecting devices (which is probably available from both OU and OU-owned FutureLearn server logs), I wonder if the OU strategists and curriculum planners are considering a future where a significant percentage of OUr distance education students do not have access to a “personal (general purpose) computer” onto which arbitrary software applications can be installed rather than from which they can simply be accessed, but do have access to a network connection via a tablet device, and perhaps a wireless keyboard?

And if the learners do have access to a desktop or laptop computer, what happens if that is likely to be a works machine, or perhaps a public access desktop computer (though I’m not sure how much longer they will remain around), probably with administrative access limits on it (if the OU IT department’s obsession with minimising general purpose and end-user defined computing is anything to go by…)

If we are to require students to make use of “installed software” rather than software that can be accessed via browser based clients/user interfaces, then we will need to ask the access question: is it fair to require students to buy a desktop computer onto which software can be installed purely for the purposes of their studies, given they presumably have access otherwise to all the (online) digital services they need?

I seem to recall that the OU’s student computing requirements are now supposed to be agnostic as to operating system (the same is not true internally, unfortunately, where legacy systems still require Windows and may even require obsolete versions of IE!;-) although the general guidance on the matter is somewhat vague and perhaps not a little out of date…?!

I wish I’d kept copies of OU computing (and network) requirements over the years. Today, network access is likely to come in the form of either wired, fibre, or wireless broadband access (the latter particularly in rural areas, (example) or (for the cord-cutters), a mobile/3G-4G connection; personal computing devices that connect to the network are likely to be smartphones, tablets, laptop computers, Chromebooks and their ilk, and gaming desktop machines. Time was when a household was lucky to have a single personal desktop computer, a requirement that became expected of OU students. I suspect that is still largely true… (the yoof’s gaming machine; the 10 year old “office” machine).

If we require students to run “desktop” applications, should we then require the students to have access to computers capable of installing those applications on their own computer, or should we be making those applications available in a way that allows them to be installed and run anywhere – either on local machines (for offline use), or on remote machines (either third party managed or managed by the OU) where a network connection is more or less always guaranteed?

One of the reasons I’m so taken by the idea of containerised computing is that it provides us with a mechanism for deploying applications to students that can be run in a variety of ways. Individuals can run the applications on their own computers, in the cloud, via service providers accessed and paid for directly by the students on a metered basis, or by the OU.

Container contents can be very strictly version controlled and archived, and are easily restored if something should go wrong (there are various ‘switch-it-off-and-switch-it-on-again’ possibilities with several degrees of severity!) Container image files can be distributed using physical media (USB memory sticks, memory cards) for local use, and for OU cloud servers, at least, those images could be pre-installed on student accessed container servers (meaning the containers can start up relatively quickly…)

If updates are required, these are likely to be lightweight – only those bits of the application that need updating will be updated.

At the moment, I’m not sure how easy it is to arbitrarily share a data container containing a student’s work with application containers that are arbitrarily launched on various local and remote hosts? (Linking containers to Dropbox containers is one possibility, but they would perhaps be slow to synch? Flocker is perhaps another route, with its increased emphasis on linked data container management?)

If any other educational institutions, particularly those involved in distance education, are looking at using containers, I’d be interested to hear what your take is…

And if any folk in the OU are looking at containers in any context (teaching, research, project work), please get in touch – I need folk to bounce ideas around with, sanity check with, and ask for technical help!;-)

Notebooks, knitr and the Language-Markdown View Source Option…

One of the foundational principles of the web, though I suspect ever fewer people know it, is that you can “View Source” on a web page to see what bits of HTML, Javascript and CSS are used to create it.

In the WordPress editor I’m currently writing in, I’m using a Text view that lets me write vanilla HTML; but there is also a WYSIWYG (what you see is what you get) view that shows how the interpreted HTML text will look when it is rendered in the browser as a web page.

viewtext

Reflecting on IPython Markdown Opportunities in IPython Notebooks and Rstudio, it struck me that the Rmd (Rmarkdown) view used in RStudio, the HTML preview of “executed” Rmd documents generated from Rmd by knitr and the interactive Jupyter (IPython, as was) notebook view can be seen as standing in this sort of relation to each other:

rmd-wysiwyg

From that, it’s not too hard to imagine RStudio offering the following sort of RStudio/IPython notebook hybrid interface – with an Rmd “text” view, and with a notebook “visual” view (eg via an R notebook kernel):

viewrmd

And from both, we can generate the static HTML preview view.

In terms of underlying machinery, I guess we could have something like this:

rmdviewarch

I’m looking forward to it:-)

Rethinking the TM351 Virtual Machine Again, Again…

It’s getting to that time when we need to freeze the virtual machine build we’re going to use for the new (postponed) data course, which should hopefully go live to students in February, 2016, and I’ve been having a rethink about how to put it together.

The story so far has been documented in several blog posts and charts my learning journey from knowing nothing about virtual machines (not sure why I was given the task of putting it together?!) to knowing how little I know about building Linux administration, PostgreSQL, MongoDB, Linux networking, virtual machines and virtualisation (which is to say, knowing I don’t know enough to do any of this stuff properly…;-)

The original plan was to put everything into a single VM and wire all the bits together. One of the activities needed to fire up several containers as part of a mongo replica set, and I opted to use containers to do that.

Over the last few months, I started to wonder whether we should containerise everything separately, then deploy compositions of containers. The rationale behind this approach is that it means we could make use of a single VM to host applications for several users if we get as far as cloud hosting services/applications for out students. It also means students can start, stop or “reinstall” particular applications in isolation from the other VM applications they’re running.

I think I’ve got this working in part now, though it’s still very much tied to the single user – I’m doing things with permissions that would never be allowed (and that would possibly break things..) if we were running multiple users in the same VM.

So what’s the solution? I posted the first hints in Kiteflying Around Containers – A Better Alternative to Course VMs? where I proved to myself I could fire up an IPthyon notebook server on top of scientific distribution stack, and get the notebooks talking to a DBMS running in another container. (This was point and click easy, once you know what to click and what numbers to put where.)

The next step was to see if I could automate this in some way. As Kitematic is still short of a Windows client, and doesn’t (yet?) support Docker Compose, I thought I’d stick with vagrant (which I was using to build the original VM using a Puppet provision and puppet scripts for each app) and see if I could get it provision a VM to run containerised apps using docker. There are still a few bits to do – most notably trying to get the original dockerised mongodb stuff working, checking the mongo link works, working out where to try to persist the DBMS data files (possibly in a shared folder on host?) in a way that doesn’t trash them each time a DBMS container is started, and probably a load of other stuff – but the initial baby steps seem promising…

In the original VM, I wanted to expose a terminal through the browser, which meant pfaffing around with tty.js and node.js. The latest Jupyter server includes the ability to launch a browser based shell client, which meant I could get rid of tty.js. However, moving the IPython notebook into a container means that the terminal presumably has scope only within that container, rather than having access to the base VM command line? For various reasons, I intend to run the IPython/Jupyter notebook server container as a privileged container, which means it can reach outside the container (I think? The reason? eg to fire up containers for the mongo replica set activity) but I’m not sure if this applies to the command line/terminal app too? Though offhand, I can’t think why we might want to provide students with access to the base VM command line?

Anyway, the local set-up looks like this…

A simple Vagrantfile, called using vagrant up or vagrant reload. I have extended vagrant using the vagrant-docker-compose plugin that supports Docker Compose (fig, as was) and lets me fired up wired-together container configurations from a single script:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.network(:forwarded_port, guest: 9000, host: 9000)
  config.vm.network(:forwarded_port, guest: 8888, host: 8351,auto_correct: true)

  config.vm.provision :docker
  config.vm.provision :docker_compose, yml: "/vagrant/docker-compose.yml", rebuild: true, run: "always"
end

The YAML file identifies the containers I want to run and the composition rules between them:

ui:
  image: dockerui/dockerui
  ports:
    - "9000:9000"
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  privileged: true

ipynb:
  build: ./tm351_scipystacknserver
  ports:
    - "8888:8888"
  volumes:
    - ./notebooks/:/notebooks/
  links:
    - devpostgres:postgres
  privileged: true
    
devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data
 
devpostgres:
    environment:
        - POSTGRES_PASSWORD=whatever
    image: postgres
    ports:
        - "5432:5432"
    volumes_from:
        - devpostgresdata

At the moment, Mongo is still missing and I haven’t properly worked out what to do with the PostgreSQL datastore – the idea is that students will be given a pre-populated, pre-indexed database, in part at least.

One additional component that sort of replaces the command line/terminal app requirement from the original VM is the dockerui app. This runs in its own container with privileged access to the docker environment and that provides a simple control panel over all the containers:

DockerUI

What else? The notebook stuff has a shared notebooks directory with host, and is built locally (from a Dockerfile in the local tm351_scipystacknserver directory) on top of the ipython/scipystack image; extensions include some additional package installations (requiring both apt-get and pip installs) and copying across and running a custom IPython notebook template configuration.

FROM ipython/scipystack

MAINTAINER OU

ADD build_tm351_stack.sh /tmp/build_tm351_stack.sh
RUN bash /tmp/build_tm351_stack.sh


ADD ipynb_style /tmp/ipynb_style
ADD ipynb_custom.sh /tmp/ipynb_custom.sh
RUN bash /tmp/ipynb_custom.sh


## Extremely basic test of install
RUN python2 -c "import psycopg2, sqlalchemy"
RUN python3 -c "import psycopg2, sqlalchemy"

# Clean up from build
RUN rm -f /tmp/build_tm351_stack.sh
RUN rm -f /tmp/ipynb_custom.sh
RUN rm -f -r /tmp/ipynb_style

VOLUME /notebooks
WORKDIR /notebooks

EXPOSE 8888

ADD notebook.sh /
RUN chmod u+x /notebook.sh

CMD ["/notebook.sh"]

demo-tm351

If we need to extend the PostgreSQL build, that can be presumably done using a Dockerfile that pulls in the core image and then runs an additional configuration script over it?

So where am I at? No f****g idea. I thought that between the data course and the new web apps course we might be able to explore some interesting models of using virtual machines (originally) and containers (more recently) in a distance education setting, that could cope with single user home use, computer training room/lab use, cloud use, but, as ever, I have spectacularly failed to demonstrate any sort of “academic leadership” in developing these ideas within the OU, or even getting much of a conversation going in the first place. Not in my skill set, I guess!;-) Though perhaps not in the institution’s interests either. Recamp. Retrench. Lockdown. As per some of the sentiments in Reflections on the Closure of Yahoo Pipes, perhaps? Don’t Play Here.

Kiteflying Around Containers – A Better Alternative to Course VMs?

Eighteen months or so ago, I started looking at ways in which we might use a virtual machine to bundle up a variety of interoperating software applications for a distance education course on databases and data management. (This VM would run IPython notebooks as the programming surface, PostgreSQL and MongoDB as the databases. I was also keen that OpenRefine should be made available, and as everything in the VM was being accessed via a browser, I added a browser based terminal app (tty.js) to the mix as well). The approach I started to follow was to use vagrant as a provisioner and VM manager, and puppet scripts to build the various applications. One reason for this approach is that the OU is an industrial scale educator, and (to my mind) it made sense to explore a model that would support the factory line production model we have in a way that would scale vertically as a way of maintaining VMs for a course that runs over several ways as well as horizontally across other courses with other software application requirements. You can see how my thinking evolved across the following posts: posts tagged “VM” on OUseful.info.

Since then, a lot has changed. IPython notebooks have forked into the Jupyter notebook server and IPython, and Jupyter has added a browser based terminal app to the base offerings of the notebook server. (It’s not as good a flexible as tty.js, which allowed for multiple terminals in the same browser window, but I guess there’s nothing to stop you loading multiple terminals into separate browser tabs.) docker has also become a thing…

To recap on some of thinking about how we might provide software to students, I was pre-occupied at various times with the following (not necessarily exhaustive) list of considerations:

  • how could we manage the installation and configuration of different software applications on students’ self-managed, remote computers, running arbitrary versions of arbitrary operating systems on arbitrarily specced machines over networks with unknown and perhaps low bandwidth internet connections;
  • how could we make sure those applications interoperated correctly on the students’ own machines;
  • how could we make sure the students retained access to local copies of all the files they had created as part of their studies, and that those local copies would be the ones they actually worked on in the provided software applications; (so for example, IPython notebook files, and perhaps even database data directories);
  • how could we manage the build of each application in the OU production context, with OU course teams requiring access to a possibly evolving version of the machine 18 months in advance of student first use date and an anticipated ‘gold master’ freeze date on elements of the software build ~9 months prior to students’ first use;
  • how could we manage the maintenance of VMs within a single presentation of a 9 month long course and across several presentations of the course spanning 1 presentation a year over a 5 year period;
  • how could the process support the build and configuration of the same software application for several courses (for example, an OU-standard PostgreSQL build);
  • how could the same process/workflow support the development, packaging, release to students, maintenance workflow for other software applications for other courses;
  • could the same process be used to manage the deployment of application sets to students on a cloud served basis, either through a managed OU cloud, or on a self-served basis, perhaps using an arbitrary cloud service provider.

All this bearing in mind that I know nothing about managing software packaging, maintenance and deployment in any sort of environment, let alone a production one…;-) And all this bearing in mind that I don’t think anybody else really cares about any of the above…;-)

Having spent a few weeks away from the VM, I’m now thinking that we would be better served by using a more piecemeal approach based around docker containers. These still require the use of something like Virtualbox, but rather than using vagrant to provision the necessary environment, we could use more of an appstore approach to starting and stopping services. So for example, today I had a quick play with Kitematic, a recent docker acquisition, and an app that doesn’t run on Windows yet but for which Windows supported is slated for June, 2015 in the Kitematic roadmap on github

So what’s involved? Install Kitematic (if Virtualbox isn’t already installed, I think it’ll grab it down for you?) and fire it up…

Kitematic_1

It starts up a dockerised virtual machine into which you can install various containers. Next up, you’re presented with an “app dashboard”, as well as the ability to search dockerhub for additional “apps”:

Kitematic_2

Find a container you want, and select it – this will download the required components and fire up the container.

Kitematic_3

The port tells you where you can find any services exposed by the container. In this case, for scipyserver, it’s an IPython notebook (HTML app) running on top of a scipy stack.

Kitematic_4

By default the service runs over https with a default password; we can go into the Settings for the container, reset the Jupyter server password, force it to use http rather than https, and save to force the container to use the new settings:

Kitematic_5

So for example…

kitematic_ipynb

In the Kitematic container homepage, if I click on the notebooks folder icon in the Edit Files panel, I can share the notebook folder across to my host machine:

scipyserver_share

I can also choose what directory on host to use as the shared folder:

Kitematic_7

I can also discover and fire up some other containers – a PostgreSQL database, for example, as well as a MongoDB database server:

Kitematic_6

From within my notebook, I can install additional packages and libraries and then connect to the databases. So for example, I can connect to the PostgreSQL database:

kitematic_ipynb_postgres

or to mongo:

kitematic_ipynb_mongodb

Looking at the container Edit Files settings, it looks like I may also be able to share across the database datafiles – though I’m not sure how this would work if I had a default database configuration to being with? (Working out how to pre-configure and then share database contents from containerised DBMS’ is something that’s puzzled me for a bit and something I haven’t got my head round yet).

So – how does this fit into the OU model (that doesn’t really exist yet?) for using VMs to make interoperating software collections available to students on their own machines?

First up, no Windows support at the moment, though that looks like it’s coming; secondly, the ability to mount shares with host seems to work, though I haven’t tested what happens if you shutdown and start up containers, or delete a scipyserver container and then fire up a clean replacement for example. Nor do I know (yet?!) how to manage shares and pre-seeding for the database containers. One original argument for the VM was that interoperability between the various software applications could be hardwired and tested. Kitematic doesn’t support fig/Docker compose (yet?) but it’s not too hard to lookup up the addresses paste them into a notebook. I think it does mean we can’t provide hard coded notebooks with ‘guaranteed to work’ configurations (i.e. ones prewritten with service addresses and port numbers) baked in, but it’s not too hard to do this manually. In the docker container Dockerfiles, I’m not sure if we could fix the port number mappings to initial default values?

One thing we’d originally envisioned for the VM was shipping it on a USB stick. It would be handy to be able to point Kitematic to a local dockerhub, for example, a set of prebuilt containers on a USB stick with the necessary JSON metadata file to announce what containers were available there, so that containers could be installed from the USB stick. (Kitematic currently grabs the container elements down from dockerhub and pops the layers into the VM (I assume?), so it could do the same to grab them from the USB stick?) In the longer term, I could imagine an OU branded version of Kitematic that allows containers to be installed from a USB stick or pulled down from an OU hosted dockerhub.

But then again, I also imagined an OU USB study stick and an OU desktop software updater 9 years or so ago and they never went anywhere either..;-)

Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

notebooks:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/notebooks:/notebooks"
data:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/data:/data"

devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data

devpostgres:
    environment:
        - POSTGRES_PASSWORD
    image: postgres
    ports:
        - "5433:5432"
    volumes_from:
        - devpostgresdata

notebook:
    environment:
        - PASSWORD
    image: calvingiles/data-science-environment
    links:
        - devpostgres:postgres
    ports:
        - "443:8888"
    volumes_from:
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm
PASSWORD=MyPass POSTGRES_PASSWORD=PGPass fig up -d

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os
os.environ