With TM351 about to start, I thought I’d revisit the docker container approach to delivering the services required by the course to see if I could get something akin to the course software running in the cloud.
A couple of handy ones out of the can are:
- the customised Jupyter notebooks and course python distribution [psychemedia/ou-tm351-pystack]
- the version of OpenRefine we’re using in the course [psychemedia/ou-tm351-openrefine]
That said, at the current time, the images are not intended for use as part of the course…
The following docker-compose.yml file will create a set of linked containers that resemble (ish!) the monolithic VM we distributed to students as a Virtualbox box.
dockerui: container_name: tm351-dockerui image: dockerui/dockerui ports: - "35183:9000" volumes: - /var/run/docker.sock:/var/run/docker.sock privileged: true devmongodata: container_name: tm351-devmongodata command: echo mongodb_created #Share same layers as the container we want to link to? image: mongo:3.0.7 volumes: - /data/db mongodb: container_name: tm351-mongodb image: mongo:3.0.7 ports: - "27017:27017" volumes_from: - devmongodata command: --smallfiles mongodb-seed: container_name: tm351-mongodb-seed image: psychemedia/ou-tm351-mongodb-simple-seed links: - mongodb devpostgresdata: container_name: tm351-devpostgresdata command: echo created image: busybox volumes: - /var/lib/postgresql/data postgres: container_name: tm351-postgres environment: - POSTGRES_PASSWORD=PGPass image: psychemedia/ou-tm351-postgres ports: - "5432:5432" openrefine: container_name: tm351-openrefine image: psychemedia/tm351-openrefine ports: - "35181:3333" privileged: true notebook: container_name: tm351-notebook #build: ./tm351_scipystacknserver image: psychemedia/ou-tm351-pystack ports: - "35180:8888" links: - postgres:postgres - mongodb:mongodb - openrefine:openrefine privileged: true
Place a copy of the docker-compose.yml YAML file somewhere, from Kitematic, open the command line, cd into the directory containing the YAML file, and enter the command docker-compose up -d – the images are on dockerhub and should be downloaded automatically.
Refer back to Kitematic and you should see running containers – the settings panel for the notebooks container shows the address you can find the notebook server at.
The notebooks and OpenRefine containers should also be linked to shared folders in the directory you ran the Docker Compose script from.
Running the Containers in the Cloud – Docker-Machine and Digital Ocean
As well as running the linked containers on my own machine, my real intention was to see how easy it would be to get them running in the cloud and using just the browser on my own computer to access them.
And it turns out to be really easy. The following example uses cloud host Digital Ocean.
To start with, you’ll need a Digital Ocean account with some credit in it and a Digital Ocean API token:
(You may be able to get some Digital Ocean credit for free as part of the Github Education Student Developer Pack.)
Then it’s just a case of a few command line instructions to get things running using Docker Machine:
docker-machine ls #kitematic usess: default #Create a droplet on Digital Ocean docker-machine create -d digitalocean --digitalocean-access-token YOUR_ACCESS_TOKEN --digitalocean-region lon1 --digitalocean-size 4gb ou-tm351-test #Check the IP address of the machine docker-machine ip ou-tm351-test #Display information about the machine docker-machine env ou-tm351-test #This returns necessary config details #For example: ##export DOCKER_TLS_VERIFY="1" ##export DOCKER_HOST="tcp://IP_ADDRESS:2376" ##export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test" ##export DOCKER_MACHINE_NAME="ou-tm351-test" # Run this command to configure your shell: # eval $(docker-machine env ou-tm351-test) #Set the environment variables as recommended export DOCKER_TLS_VERIFY="1" export DOCKER_HOST="tcp://IP_ADDRESS:2376" export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test" #Run command to set current docker-machine eval "$(docker-machine env ou-tm351-test)" #If the docker-compose.yml file is in . docker-compose up -d #This will launch the linked containers on Digital Ocean #The notebooks should now be viewable at: #http://IP_ADDRESS:35180 #OpenRefine should now be viewable at: #http://IP_ADDRESS:35181 #To stop the machine docker-machine stop ou-tm351-test #To remove the Digital Ocean droplet (so you stop paying for it... docker-machine rm ou-tm351-test #Reset the current docker machine to the Kitematic machine eval "$(docker-machine env default)"
So that’s a start. Issues still arise in terms of persisting state, such as the database contents, notebook files* and OpenRefine projects: if you leave the containers running on Digital Ocean to persist the state, the meter will keep running.
(* I should probably also build a container that demos how to bake a few example notebooks into a container running the notebook server and TM351 python distribution.)
So this is how I currently think of the TM351 VM:
What would be nice would be a drag’n’drop tool to let me draw pictures like that that would then generate the build scripts… (a docker compose script, or set of puppter scripts, for the architectural bits on the left, and a Vagrantfile to set up the port forwarding, for example).
For docker, I wouldn’t have thought that would be too hard – a docker compose file could describe most of that picture, right? Not sure how fiddly it would be for a more traditional VM, though, depending on how it was put together?
Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.
One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.
docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz docker save busybox:latest | gzip -c > test_busybox.tgz docker save mongo:latest | gzip -c > test_mongo.tgz docker save postgres:latest | gzip -c > test_postgres.tgz
On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:
docker load < test_mongo.tgz
assuming the test_mongo.tgz file is in the current working directory.
Another I need to explore is how to get the set up the data volume containers on the students’ machine.
The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)
The traditional backup and restore route for PostgreSQL seems to be something like the following:
#Use docker exec to run a postgres export docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c > dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql #If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz #The restore route would presumably be something like: cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres #For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres
For mongo, things seem to be a little bit more complicated. Something like:
docker exec -t vagrant_mongo_1 mongodump #Complementary restore command is: mongorestore
would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?
By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:
#If you need to get into a container: vagrant ssh #Then in the VM: docker exec -it CONTAINERNAME bash
Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:
# Export data from a data volume container that is linked to a database server #postgres docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data #I wonder if these should be run with --rm to dispose of the temporary container once run? #mongo - BUT SEE CAVEAT BELOW docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db
We can then take the tar file, distribute it to students, and use it to seed a data volume container.
Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:
#Create a data volume container docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true #Restore the contents docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar" #Note - the docker helpfiles don't show how to use sh -c - which appears to be required... #Again, I wonder whether this should be run with --rm somewhere to minimise clutter?
Unfortunately, things don’t seem to run so smoothly with mongo?
#Unfortunately, when trying to run a mongo server against a data volume container #the presence of a mongod.lock seems to break things #We probably shouldn't do this, but if the database has settled down and completed # all its writes, it should be okay?! docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock #This generates a copy of the distributable file without the lock... #Here's an example of the reconstitution from the distributable file for mongo docker create -v /data/db --name devmongodata busybox true docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"
(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)
If have a docker-compose.yml file in the working directory like the following:
mongo: image: mongo ports: - "27017:27017" volumes_from: - devmongodata ##We DO NOT need to declare the data volume here #We have already created it #Also, if we leave it in, a "docker-compose rm" command #will destroy the data volume container... #...which means we wouldn't persist the data in it #devmongodata: # command: echo created # image: busybox # volumes: # - /data/db
We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.
I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up
You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.
A week or so ago I came across a couple of IPython notebooks produced by Catherine Devlin covering the maintenance and tuning of a PostgreSQL server: DB Introspection Notebook (example 1: introspection, example 2: tuning, example 3: performance checklist). One of the things we have been discussing in the TM351 course team meetings is the extent to which we “show our working” to students in terms how the virtual machine and the various databases used in the course were put together, even if we don’t actually teach that stuff.
Notebooks make an ideal way of documenting the steps taken to set up a particular system, blending commentary with command line as well as code executable cells.
The various approaches I’ve explored to build the VM have arguably been over-complex – vagrant, puppet, docker and docker-compose – but I’ve always seen the OU as a place where we explore the technologies we’re teaching – or might teach – in the context of both course production and course delivery (that is, we can often use a reflexive approach whereby the content of the teaching also informs the development and delivery of the teaching).
In contrast, in A DevOps Approach to Common Environment Educational Software Provisioning and Deployment I referred to a couple of examples of a far simpler approach, in which common research, or research and teaching, VM environments were put together using simple scripts. This approach is perhaps more straightforwardly composable, in that if someone has additional requirements of the machine, they can just add a simple configuration script to bring in those elements.
In our current course example, where the multiple authors have a range of skill and interest levels when it comes to installing software and exploring different approaches to machine building, I’m starting to wonder whether I should have started with a simple base machine running just an IPython notebook server and no additional libraries or packages, and then created series of notebooks, one for each part of the course (which broadly breaks down to one part per author), containing instructions for installing all the bits and pieces required for just that part of the course. If there’s duplication across parts, trying to install the same thing for each part, that’s fine – the various package managers should be able to cope with that. (The only issue would arise if different authors needed different versions of the same package, for some reason, and I’m not sure what we’d do in that case?)
The notebooks would then include explanatory documentation and code cells to install Linux packages and python packages. Authors could take over the control of setup notebooks, or just make basic requests. At some level, we might identify a core offering (for example, in our course, this might be the inclusion of the pandas package) that might be pushed up into a core configuration installation notebook executed prior to the installation notebook for each part.
Configuring the machine would then be a case of running the separate configuration notebooks for each part (perhaps preceded by a core configuration notebook), perhaps by automated means. For example, ipython nbconvert --to=html --ExecutePreprocessor.enabled=True configNotebook_1.ipynb will [via StackOverflow]. This generates an output HTML report from running the code cells in the notebook (which can include command line commands) in a headless IPython process (I think!).
The following switch may also be useful (it clears the output cells): ipython nbconvert --to=pdf --ExecutePreprocessor.enabled=True --ClearOutputPreprocessor.enabled=True RunMe.ipynb (note in this case we generate a PDF report).
To build the customised VM box, the following route should work:
– set up a simple Vagrant file to import a base box
– install IPython into the box
– copy the configuration notebooks into the box
– run the configuration notebooks
– export the customised box
This approach has the benefits of using simple, literate configuration scripts described within a notebook. This makes them perhaps a little less “hostile” than shell scripts, and perhaps makes it easier to build in tests inline, and report on them nicely. (If a running a cell results in an error, I think the execution of the notebook will stop at that point?) The downside is that to run the notebooks, we also need to have IPython installed first.
In Distributing Software to Students in a BYOD Environment, I briefly reviewed a paper that described a paper that reported on the use of Debian metapackages to support the configuration of Linux VMs for particular courses (each course has its own Debian metapackage that could install all the packages required for that course).
This idea of automating the build of machines comes under the wider banner of DevOps (development and operations). In a university setting, we might view this in several ways:
- the development of course related software environments during course production, the operational distribution and deployment of software to students, updating and support of the software in use, and maintenance and updating of software between presentations of a course;
- the development of software environments for use in research, the operation of those environments during the lifetime of a resarch project, and the archiving of those environments;
- the development and operation of institutional IT services.
In an Educause review from 2014 (Cloud Strategy for Higher Education: Building a Common Solution, EDUCAUSE Center for Analysis and Research (ECAR) Research Bulletin, November, 2014 [link]), a pitch for universities making greater use of cloud services, the authors make the observation that:
In order to make effective use of IaaS [Infrastructure as a Service], an organization has to adopt an automate-first mindset. Instead of approaching servers as individual works of art with artisan application configurations, we must think in terms of service-level automation. From operating system through application deployment, organizations need to realize the ability to automatically instantiate services entirely from source code control repositories.
This is the approach I took from the start when thinking about the TM351 virtual machine, focussing more on trying to identify production, delivery, support and maintenance models that might make sense in a factory production model that should work in a scaleable way not only across presentations of the same course, as well as across different courses, but also across different platforms (students own devices, OU managed cloud hosts, student launched commercial hosts) rather than just building a bespoke, boutique VM for a single course. (I suspect the module team would have preferred my focussing on the latter – getting something that works reliably, has been rigorously tested, and can be delivered to students rather than pfaffing around with increasingly exotic and still-not-really-invented-yet tooling that I don’t really understand to automate production of machines from scratch that still might be a bit flaky!;-)
Anyway, it seems that the folk at Berkeley have been putting together a “Common Scientific Compute Environment for Research and Education” [Clark, D., Culich, A., Hamlin, B., & Lovett, R. (2014). BCE: Berkeley’s Common Scientific Compute Environment for Research and Education, Proceedings of the 13th Python in Science Conference (SciPy 2014).]
The BCE – Berkeley Common Environment – is “a standard reference end-user environment” consisting of a simply skinned Linux desktop running in virtual machine delivered as a Virtualbox appliance that “allows for targeted instructions that can assume all features of BCE are present. BCE also aims to be stable, reliable, and reduce complexity more than it increases it”. The development team adopted a DevOps style approach customised for the purposes of supporting end-user scientific computing, arising from the recognition that they “can’t control the base environment that users will have on their laptop or workstation, nor do we wish to! A useful environment should provide consistency and not depend on or interfere with users’ existing setup”, further “restrict[ing] ourselves to focusing on those tools that we’ve found useful to automate the steps that come before you start doing science“. Three main frames of reference were identified:
- instructional: students could come from all backgrounds and often unlikely to have sys admin skills over and above the ability to use a simple GUI approach to software installation: “The most accessible instructions will only require skills possessed by the broadest number of people. In particular, many potential students are not yet fluent with notions of package management, scripting, or even the basic idea of commandline interfaces. … [W]e wish to set up an isolated, uniform environment in its totality where instructions can provide essentially pixel-identical guides to what the student will see on their own screen.”
- scientific collaboration: that is, the research context: “It is unreasonable to expect any researcher to develop code along with instructions on how to run that code on any potential environment.” In addition, “[i]t is common to encounter a researcher with three or more Python distributions installed on their machine, and this user will have no idea how to manage their command-line path, or which packages are installed where. … These nascent scientific coders will have at various points had a working system for a particular task, and often arrive at a state in which nothing seems to work.”
- Central support: “The more broadly a standard environment is adopted across campus, the more familiar it will be to all students”, with obvious benefits when providing training or support based on the common environment.
Whilst it was recognised the personal laptop computers are perhaps the most widely used platform, the team argued that the “environment should not be restricted to personal computers”. Some scientific computing operations are likely to stretch the resources of a personal laptop, so the environment should also be capable of running on other platforms such as hefty workstations or on a scientific computing cloud.
The first consideration was to standardise on an O/S: Linux. Since the majority of users don’t run Linux machines, this required the use of a virtual machine (VM) to host the Linux system, whilst still recognising that “one should assume that any VM solution will not work for some individuals and provide a fallback solution (particularly for instructional environments) on a remote server”.
Another issue that can arise is dealing with mappings between host and guest OS, which vary from system to system – arguing for the utility of an abstraction layer for VM configuration like Vagrant or Packer … . This includes things like portmapping, shared files, enabling control of the display for a GUI vs. enabling network routing for remote operation. These settings may also interact with the way the guest OS is configured.
Reflecting on the “traditional” way of building a computing environment, the authors argued for a more automated approach:
Creating an image or environment is often called provisioning. The way this was done in traditional systems operation was interactively, perhaps using a hybrid of GUI, networked, and command-line tools. The DevOps philosophy encourages that we accomplish as much as possible with scripts (ideally checked into version control!).
The tools explored included Ansible, packer, vagrant and docker:
- Ansible: to declare what gets put into the machine (alternatives include shell scripts, puppet etc. (For the TM351 monolithic VM, I used puppet.) End-users don’t need to know anything about Ansible, unless they want to develop a new, reproducible, custom environment.
- packer: used to run the provisioners and construct and package up a base box. Again, end-users don’t need to know anything about this. (For the TM351 monolithic VM, I used vagrant to build a basebox in Virtualbox, and then package it; the power of Packer is that is lets you generate builds from a common source for a variety of platforms (AWS, Virtualbox, etc etc).)
- vagrant: their description is quite a handy one: “a wrapper around virtualization software that automates the process of configuring and starting a VM from a special Vagrant box image … . It is an alternative to configuring the virtualization software using the GUI interface or the system-specific command line tools provided by systems like VirtualBox or Amazon. Instead, Vagrant looks for a Vagrantfile which defines the configuration, and also establishes the directory under which the vagrant command will connect to the relevant VM. This directory is, by default, synced to the guest VM, allowing the developer to edit the files with tools on their host OS. From the command-line (under this directory), the user can start, stop, or ssh into the Vagrant-managed VM. It should be noted that (again, like Packer) Vagrant does no work directly, but rather calls out to those other platform-specific command-line tools.” However, “while Vagrant is conceptually very elegant (and cool), we are not currently using it for BCE. In our evaluation, it introduced another piece of software, requiring command-line usage before students were comfortable with it”. This is one issue we are facing with the TM351 VM – current the requirement to use vagrant to manage the VM from the commandline (albeit this only really requires a couple of commands – we can probably get away with just: vagrant up && vagrant provision and vagrant suspend – but also has a couple of benefits, like being able to trivially vagrant ssh in to the VM if absolutely necessary…).
- docker: was perceived as adding complexity, both computationally and conceptually: “Docker requires a Linux environment to host the Docker server. As such, it clearly adds additional complexity on top of the requirement to support a virtual machine. … the default method of deploying Docker (at the time of evaluation) on personal computers was with Vagrant. This approach would then also add the complexity of using Vagrant. However, recent advances with boot2docker provide something akin to a VirtualBox-only, Docker-specific replacement for Vagrant that eliminates some of this complexity, though one still needs to grapple with the cognitive load of nested virtual environments and tooling.” The recent development of Kitematic addresses some of the use-case complexity, and also provides GUI based tools for managing some of the issues described above associate with portmapping, file sharing etc. Support for linked container compositions (using Docker Compose) is still currently lacking though…
At the end of the day, Packer seems to rule them all – coping as it does with simple installation scripts and being able to then target the build for any required platform. The project homepage is here: Berkeley Common Environment and the github repo here: Berkeley Common Environment (Github).
The paper also reviewed another common environment – OSGeo. Once again built on top of a common Linux base, well documented shell scripts are used to define package installations: “[n]otably, the project uses none of the recent DevOps tools. OSGeo-Live is instead configured using simple and modular combinations of Python, Perl and shell scripts, along with clear install conventions and examples. Documentation is given high priority. … Scripts may call package managers, and generally have few constraints (apart from conventions like keeping recipes contained to a particular directory)”. In addition, “[s]ome concerns, like port number usage, have to be explicitly managed at a global level”. This approach contrasts with the approach reviewed in Distributing Software to Students in a BYOD Environment where Debian metapackages were used to create a common environment installation route.
The idea of a common environment is a good one, and that would work particularly well in a curriculum such as Computing, I think? One main difference between the BCE approach and the TM351 approach is that BCE is self-contained and runs a desktop environment within the VM, whereas the TM351 environment uses a headless VM and follows more of a microservice approach that publishes HTML based service UIs via http ports that can be viewed in a browser. One disadvantage of the latter approach is that you need to keep a more careful eye on port assignments (in the event of collisions) when running the VM locally.
Reading around a variety of articles on the various ways of deploying software in education, it struck me that in traditional institutions a switch is may be taking place between students making use of centrally provided computing services – including physical access to desktop computers – to students bringing their own devices on which they may want to run the course software themselves. In addition, traditional universities are also starting to engage increasingly with their own distance education students; and the rise of the MOOCs are based around the idea of online course provision – that is, distance education.
The switch from centrally provided computers to a BYOD regime contrasts with the traditional approach in distance education in which students traditionally provided their own devices and onto which they installed software packaged and provided by their educational institution. That is, distance education students have traditionally been BYOD users.
However, in much the same way that the library in a distance education institution like the OU could not originally provide physical information (book lending) services to students, instead brokering access agreements with other HE libraries, but now can provide a traditional a traditional library service through access to digital collections, academic computing services are perhaps now more in a position where they can provide central computing services, at scale, to their students. (Contributory factors include: readily available network access for students, cheaper provider infrastructure costs (servers, storage, bandwidth, etc).)
With this in mind, it is perhaps instructive for those of us working in distance education to look at how the traditional providers are coping with an an influx of BYOD users, and how they are managing access to, and the distribution of, software to this newly emerging class of user (for them) whilst at the same time continuing to provide access to managed facilities such as computing labs and student accessed machines.
Notes from: Supporting CS Education via Virtualization and Packages – Tools for Successfully Accommodating “Bring-Your-Own-Device” at Scale, Andy Sayler, Dirk Grunwald, John Black, Elizabeth White, and Matthew Monaco SIGCSE’14, March 5–8, 2014, Atlanta, GA, USA [PDF]
The authors describe “a standardized development environment for all core CS courses across a range of both school-owned and student-owned computing devices”, leveraging “existing off-the-shelf virtualization and software management systems to create a common virtual machine that is used across all of our core computer science courses”. The goal was to “provide students with an easy to install and use development environment that they could use across all their CS courses. The development environment should be available both on department lab machines, and as a VM for use on student-owned machines (e.g. as a ‘lab in a box’).”
From the student perspective, our solution had to: a) Run on a range of host systems; b) Be easy to install; c) Be easy to use and maintain; d) Minimize side-effects on the host system; e) Provide a stable experience throughout the semester.
From the instructor perspective, our solution had to: a) Keep the students happy; b) Minimize instructor IT overhead; c) Provide consistent results across student, grader, and instructor machines; d) Provide all necessary software for the course; e) Provide the ability to update software as the course progresses.
Virtualbox was adopted on the grounds that it runs cross-platform, is free, open source software, and has good support for running Linux guest machines. The VM was based on Ubuntu 12.04 (presumably the long term edition available at the time) and distributed as an .ova image.
To support the distribution of software packages for a particular course, Debian metapackages (that simply list dependencies; in passing, I note that the Anaconda python distribution supports the notion of python (conda) metapackages, but pip does not, specifically?) were created on a per course basis that could be used via apt-get to install all the necessary packages required for a particular course (example package files).
In terms of student support, the team published “a central web-page that provides information about the VM, download links, installation instructions, common troubleshooting steps, and related self-help information” along with “YouTube videos describing the installation and usage of the VM”. Initial distribution is provided using BitTorrent. Where face-to-face help sessions are required, VM images are provided on USB memory sticks to avoid download time delays. Backups are handled by bundling Dropbox into the VM and encouraging students to place their files there. (Github is also used.)
The following observation is useful in respect of student experience of VM performance:
“Modern CPUs provide extensions that enable a fast, smooth and enjoyable VM experience (i.e. VT-x). Unfortunately, many non-Apple PC manufacturers ship their machines with these extension disabled in the BIOS. Getting students to enable these extensions can be a challenge, but makes a big difference in their overall impression of VM usability. One way to force students to enable these extensions is to use a 64-bit and/or multi-core VM, which VirtualBox will not start without virtualization extensions enabled.”
The open issues identified by the team are the issue of virtualisation support; corrupted downloads of the VM (mitigation includes publishing a checksum for the VM and verifying against this); and the lack of a computer capable of running the VM (ARM devices, low specification Intel Atom computers). [On this latter point, it may be worth highlighting the distinction between hardware that cannot cope with running computationally intensive applications, hardware that has storage limitations, and hardware that cannot run particular virtualisation services (for example, that cannot run x86 virtualisation). See also: What Happens When “Computers” Are Replaced by Tablets and Phones?]
The idea of using package management is attractive, and contrasts with the approach I took when hacking together the TM351 VM using vagrant and puppet scripts. It might make sense to further abstract the machine components into a Debian metapackage and a simple python/pip “meta” package (i.e. one that simply lists dependencies). The result would be an installation reduced to a couple of lines of the form:
apt-get install ou_tm351=15J.0
pip install ou_tm351==15J.0
where packages are versioned to a particular presentation of an OU course, with a minor version number to accommodate any updates/patches. One downside to this approach is that it splits co-dependency relationships between python and Debian packages relative to a particular application. In the current puppet build files for the monolithic VM build, each application has its own puppet file that installs the additional libraries over base libraries required for a particular application. (In addition, particular applications can specify dependencies on base libraries.) For the dockerised VM build, each container image has it’s own Dockerfile that identifies the dependencies for that image.
Tracing its history (and reflecting the accumulated clutter of my personal VM learning journey!) the draft TM351 VM is currently launched and provisioned using vagrant, partly because I can’t seem to start the IPython Notebook reliably from a startup script:-( Distributing the machine as a start/stoppable appliance (i.e. as an Open Virtualization Format/.ova package) might be more convenient, if we could guarantee that file sharing with host works as required (sharing against a specific folder on host) and any port collisions experienced by the provided services can be managed and worked around?
Port collisions are less of an issue for Sayler et al. because their model is that students will be working within the VM context – a “desktop as a local service” (or “platform as a local service” model); the TM351 VM model provides services that run within the VM, some of which are exposed via http to the host – more of a “software as a local service” model. In the cloud, software-as-a-service and desktop-as-a-service models are end-user delivery models, where users access services through a browser or lightweight desktop client, compared with “platform-as-a-service” offerings where applications can be developed and delivered within a managed development environment offering high level support services, or “infrastructure as a service” offerings, which provide access to base computing components (computational processing, storage, networking, etc.)
Note that what interests me particularly are delivery models that support all three of the following models: BYOD, campus lab, and cloud/remotely hosted offerings (as a crude shorthand, I use ‘cloud’ to mean environments that are responsive in terms of firing up servers to meet demand). The notions of personal computing environments, course computing environments and personal course computing environments might also be useful, (for example, a course computing environment might be a generic container populated with course software, a personal course computing container might then be a container linked to a student’s identity, with persisted state and linked storage, or a course container running on a students own device) alongside research computing environments and personal research computing environments.
It’s getting to that time when we need to freeze the virtual machine build we’re going to use for the new (postponed) data course, which should hopefully go live to students in February, 2016, and I’ve been having a rethink about how to put it together.
The story so far has been documented in several blog posts and charts my learning journey from knowing nothing about virtual machines (not sure why I was given the task of putting it together?!) to knowing how little I know about building Linux administration, PostgreSQL, MongoDB, Linux networking, virtual machines and virtualisation (which is to say, knowing I don’t know enough to do any of this stuff properly…;-)
The original plan was to put everything into a single VM and wire all the bits together. One of the activities needed to fire up several containers as part of a mongo replica set, and I opted to use containers to do that.
Over the last few months, I started to wonder whether we should containerise everything separately, then deploy compositions of containers. The rationale behind this approach is that it means we could make use of a single VM to host applications for several users if we get as far as cloud hosting services/applications for out students. It also means students can start, stop or “reinstall” particular applications in isolation from the other VM applications they’re running.
I think I’ve got this working in part now, though it’s still very much tied to the single user – I’m doing things with permissions that would never be allowed (and that would possibly break things..) if we were running multiple users in the same VM.
So what’s the solution? I posted the first hints in Kiteflying Around Containers – A Better Alternative to Course VMs? where I proved to myself I could fire up an IPthyon notebook server on top of scientific distribution stack, and get the notebooks talking to a DBMS running in another container. (This was point and click easy, once you know what to click and what numbers to put where.)
The next step was to see if I could automate this in some way. As Kitematic is still short of a Windows client, and doesn’t (yet?) support Docker Compose, I thought I’d stick with vagrant (which I was using to build the original VM using a Puppet provision and puppet scripts for each app) and see if I could get it provision a VM to run containerised apps using docker. There are still a few bits to do – most notably trying to get the original dockerised mongodb stuff working, checking the mongo link works, working out where to try to persist the DBMS data files (possibly in a shared folder on host?) in a way that doesn’t trash them each time a DBMS container is started, and probably a load of other stuff – but the initial baby steps seem promising…
In the original VM, I wanted to expose a terminal through the browser, which meant pfaffing around with tty.js and node.js. The latest Jupyter server includes the ability to launch a browser based shell client, which meant I could get rid of tty.js. However, moving the IPython notebook into a container means that the terminal presumably has scope only within that container, rather than having access to the base VM command line? For various reasons, I intend to run the IPython/Jupyter notebook server container as a privileged container, which means it can reach outside the container (I think? The reason? eg to fire up containers for the mongo replica set activity) but I’m not sure if this applies to the command line/terminal app too? Though offhand, I can’t think why we might want to provide students with access to the base VM command line?
Anyway, the local set-up looks like this…
A simple Vagrantfile, called using vagrant up or vagrant reload. I have extended vagrant using the vagrant-docker-compose plugin that supports Docker Compose (fig, as was) and lets me fired up wired-together container configurations from a single script:
# -*- mode: ruby -*- # vi: set ft=ruby : Vagrant.configure("2") do |config| config.vm.box = "ubuntu/trusty64" config.vm.network(:forwarded_port, guest: 9000, host: 9000) config.vm.network(:forwarded_port, guest: 8888, host: 8351,auto_correct: true) config.vm.provision :docker config.vm.provision :docker_compose, yml: "/vagrant/docker-compose.yml", rebuild: true, run: "always" end
The YAML file identifies the containers I want to run and the composition rules between them:
ui: image: dockerui/dockerui ports: - "9000:9000" volumes: - /var/run/docker.sock:/var/run/docker.sock privileged: true ipynb: build: ./tm351_scipystacknserver ports: - "8888:8888" volumes: - ./notebooks/:/notebooks/ links: - devpostgres:postgres privileged: true devpostgresdata: command: echo created image: busybox volumes: - /var/lib/postgresql/data devpostgres: environment: - POSTGRES_PASSWORD=whatever image: postgres ports: - "5432:5432" volumes_from: - devpostgresdata
At the moment, Mongo is still missing and I haven’t properly worked out what to do with the PostgreSQL datastore – the idea is that students will be given a pre-populated, pre-indexed database, in part at least.
One additional component that sort of replaces the command line/terminal app requirement from the original VM is the dockerui app. This runs in its own container with privileged access to the docker environment and that provides a simple control panel over all the containers:
What else? The notebook stuff has a shared notebooks directory with host, and is built locally (from a Dockerfile in the local tm351_scipystacknserver directory) on top of the ipython/scipystack image; extensions include some additional package installations (requiring both apt-get and pip installs) and copying across and running a custom IPython notebook template configuration.
FROM ipython/scipystack MAINTAINER OU ADD build_tm351_stack.sh /tmp/build_tm351_stack.sh RUN bash /tmp/build_tm351_stack.sh ADD ipynb_style /tmp/ipynb_style ADD ipynb_custom.sh /tmp/ipynb_custom.sh RUN bash /tmp/ipynb_custom.sh ## Extremely basic test of install RUN python2 -c "import psycopg2, sqlalchemy" RUN python3 -c "import psycopg2, sqlalchemy" # Clean up from build RUN rm -f /tmp/build_tm351_stack.sh RUN rm -f /tmp/ipynb_custom.sh RUN rm -f -r /tmp/ipynb_style VOLUME /notebooks WORKDIR /notebooks EXPOSE 8888 ADD notebook.sh / RUN chmod u+x /notebook.sh CMD ["/notebook.sh"]
If we need to extend the PostgreSQL build, that can be presumably done using a Dockerfile that pulls in the core image and then runs an additional configuration script over it?
So where am I at? No f****g idea. I thought that between the data course and the new web apps course we might be able to explore some interesting models of using virtual machines (originally) and containers (more recently) in a distance education setting, that could cope with single user home use, computer training room/lab use, cloud use, but, as ever, I have spectacularly failed to demonstrate any sort of “academic leadership” in developing these ideas within the OU, or even getting much of a conversation going in the first place. Not in my skill set, I guess!;-) Though perhaps not in the institution’s interests either. Recamp. Retrench. Lockdown. As per some of the sentiments in Reflections on the Closure of Yahoo Pipes, perhaps? Don’t Play Here.