Confused Again About VM Ecology… I Blame Not Blogging

Via a cc’d tweet from Martin Hawksey, this lovely post from Tom Smith/@everythingabili (who has the best ever twitter bio strapline) on How I Learn ( And What I’m Learning ).

I like to think that I used to write blog posts that had the same sort of sense as that post…

…but for the last few months at least, I don’t think I have.

“Working” for once – starting production on an OU course (TM351, due out October 2015 (sic; I’m gonna be late on the first draft of the 7 weeks of the course I’m responsible for: it’s due in a little over a fortnight…), and also helping out on an investigative project the School of Data is partnering on – has meant that most of the learnings and failings that I used to blog about have been turned inward to emails (which are even more poorly structured in terms of note-taking than this blog is) if at all.

Which is a shame and makes me not happy.

Reading through completed academic papers, making useful (I hope) use of them in the course draft, has been quite fun in part – and made me regret at times not writing up work of my own in a self-contained, peer reviewed way over the last decade or so; getting stuff “into the record”, properly citable, and coherent enough to actually be cited. But as you pick away at the papers, you know they’re a revisionist telling, a straightforward narrative of how the pieces fit together and in which nothing went wrong along the way; (you also know that the closer you get to trying to replicate a paper, the harder it is to find the missing pieces (process, trick, equation or insight) that actually make it work; remember school maths, or physics, and the textbook that goes from one equation to the next with a “hence”, but there’s no way in hell you can figure out how to make that step and you know you’ll be stuck when that bit comes up in the exam…?! That. Or worse. When you bang your head against a wall trying to get something to work, contort your mental models to make it work, sort of, then see the errata list correcting that thing. That, too.)

On the other hand, this blog is not coherent, shapes no whole, but is full of hence steps. Disjointed ones, admittedly. But they keep a track of all the bits I tried at and failed at and worked around, and they keep on getting me out of holes… Like the emails won’t. Lost. Wasted effort because the learning fumblings that are OUseful learning fumblings are lost and locked up in email hell.

It makes me very not happy.

So that, by way of intro, to this: a quick catchup follow-up to Cursory Thoughts on Virtual Machines in Distance Education Courses and Doodling With IPython Notebooks for Education, a partial remembering of the various shades of hell associated with them and trying to share them.

Here’s what I think I now want to do (whether or not it’s the right thing I’m not sure).

  • generate a script that will build a VM. We’ve opted for Virtualbox as the VM container. The VM will need to contain: pandas; IPython notebook (course team want it to run Python 3.3. I’ve lost track of how many hours I’ve spent trying and failing to get Python libraries I think we need trying to run under Python 3.3; wasted effort; I should have settled with Python 2.7 and then revisited 3.3 in several months hence; the 2.7 3.3 tweaks to any code we write for the course should manageable in migration terms. Pratting around on libraries that I’m guessing will get patched as native distributions move to 3.3 by default but don’t work yet is wasted effort. October. 2015. First presentation.); PostgreSQL (perhaps with some extensions); mongodb; ipythonblocks; blockdiag; I came across shellinabox today and wonder if we should run that; OpenRefine (CT against this – I think it’s good for developing mental models); python-nvd3; folium; a ggplot port to python; (CT take – too much new stuff; my take, we should go as high up the stack as we can in terms of the grammar of the calling functions); I think we should run R and RStudio too to make for a really useful VM, making the point that the language doesn’t matter and we use whatever’s appropriate to get things done, but I don’t think anyone else does. if. Which computer language is that from then? for. Or that one? How about in? print? Cars are for driving. Mine’s blue. I have no idea how it works. Can I drive your car? The red one. With the left-hand drive.
  • access the services running on the headless VM via a browser on host. I think we should talk to the databases using Python, but I may be in the minority. We may need something more graphical to talk to postgresql. If we do, I would argue it should be a browser based client – if it’s an app, we’re moving functionality provision outside of the VM.
  • use the script to build to machines with the same package config; CT seem to prefer a single build on a 32 bit machine. I think we should support 64 bit as well. And deployment on at least one cloud service – I;d go for Amazon, but that’s mainly because it’s the only one I’ve tried. If we could support more, even better.
  • as far as maintenance goes, I wrote the vagrant script to update libraries whenever the provisioner is run (which is quite a lot at the mo as I keep finding new things to add to the box!;-) This may or may not be sensible for student use. If there is a bug in a library, an update could help. If there is a security patch to the kernel, we should be updating as good citizens. The current plan is to ship a built box (which I think would have to go on to a USB stick – we can’t rely on folk having computers with a DVD any more, and a 1.5GB download seems to be proving unreliable without a proper download manager. As it is, students will have to download virtualbox and vagrant, and install those themselves. (Unless we can put installers for them on a USB stick too.) If we do ship a built box, we need to think of some way of kickstarting the services and perhaps rebooting the machine (and then kickstarting the services). There is a separate question of whether we should be also be able to update config scripts during presentation. This would presumably have to be done on the host. One way might be to put config scripts on a git server then use git to keep the config scripts on the students’ host machine up to date, but that would probably also require them to install a git commandline tool, even if we automated the rest. Did I say this all has to run cross platform? Students may be on Windows (likely?), Mac or Linux. I think the course should be doable, if painfully, via a tablet, which means the VM needs the cloud hosted option. If we could also find a way to help students configure their whatever platform host so that they could access services from the VM running on it via their tablet, so much the better.
  • files need to be shared between VM and host. This raises an interesting issue for a cloud hosted VM. Either, we need to find a way to synch files between desktop and cloud VM, persist state on the cloud host so that the VM can synch to it, or pop dropbox into the cloud VM (though there would then be a synch delay, as there would with a desktop synch). I favour persisting on the cloud, though there is then the question of the student who is working on a home machine one day and a cloud machine the next.
  • Starting and stopping services: students need to be able to start and stop services running on the VM without having to ssh in. Once click easy. A dashboard with buttons that show if a service is running or not, click a button to toggle the run state of the the service. No idea how to do this.

Here’s the approach I’ve taken:

  • reusing DataminerUK’s infinite-interns model as a starting point, I’m using vagrant to build and provision a VM using puppet. At the moment I have duplicate setups for two different Linux base boxes (precise64 and raring64. The plan is to move to the latest Ubuntu LTS.) I really should have a single setup with the different machine setups called by name from a single Vagrantfile. I think.
  • The puppet provisioner builds the box from a minimal base and starts the services running. It’s aggressive on updates. The precise64 box is running python 2.7 and the raring64 box 3.3. Getting pip3 running in the raring box was a pain, and I don’t know how to tell puppet to use the pip3 thing I eventually installed to update. At the moment I fudge with:
    exec { "pip3-update":
    command => "/usr/local/bin/pip3 install --upgrade pip"
    }

    but it breaks because I’m not convinced that is always the right path (I’d like to hedge on /usr/bin:/usr/local/bin), or that pip3 is actually installed when I try to exec it… I think what I really want to do is something like
    package {
    [
    JUST UPGRADE YOURSELF, PLEASE
    ]: ensure => latest,
    provider => 'pip3';
    }

    with an additional dependency check (=>) that pip3 has been installed first, and from all the other pip3 installs that pip3 has been upgraded first.
  • The IPython notebook is started by a config shell script called from puppet. I think I’m also using a config script to set up a user and test tables in Postgres (though I plan to move to the puppet extension as soon as I can get round to reading the docs/finding out how to do it).
  • There are times when a server needs restarting. At the moment I have to run vagrant provision to do that – or even vagrant halt;vagrant up, which means it also runs the updates. It’d be nice to just be able to run the bits that restart the services (the DBMS’, IPython notebook etc) without doing any of the package updates, installs, checks etc.
  • We need a tool to check whether services are running on the necessary ports to help debugging without requiring a user to SSH into the VM; I’ve also fixed on default ports. We really need to change ports if a default port is being used to a free port and then somehow tell the IPython notebook scripts which port each service is running on. With vagrant letting you run a VM from within a particular directory, being able to know what VMs are being run and from where, wherever vagrant on host started them, would be useful.
  • I don’t use a configurator for the postgres db (it needs seeding with some example tables) but should do – on my to do list is to look at https://github.com/puppetlabs/puppetlabs-postgresql . Similarly for mongo db – and perhaps https://github.com/puppetlabs/puppetlabs-mongodb
  • To make use of python-nvd3, suggested route is to use bower. I got the npm package manager to work but have failed to find a way of installing any actual packages [issue].

Issues to date, aside from things like port clashes and all manner of f**k ups because I distributed a README with an error in it and folk tried to follow it rather than patches posted elsewhere, have been impeded by not having a good way of logging and sharing errors. OU network issues have also contributed to the fun. I always avoid the OU staff network, but nothing seems to work on that. I suspect this is a proxy issue, but I’m unwilling to invest any time exploring it or how to config the VM to cope (no-one else has offered to go down this rat hole). Poxy proxies could well be an issue for some students, but I’m not sure what the best way forward is. Cloud hosted VMs?!

I also had a problem on OU eduroam – mongodb wants to get something signed from a keyserver before it will install mongodb, but OU eduroam (the network I always use) seems to block the keyserver. Tracking that down wasted a good hour.

Here are some other things I’ve heard about:

https://github.com/psychemedia/notebookcloud This is cloned from https://github.com/carlsmith who appears to have taken his repo – and the app – down? It provided a dashboard for firing up notebook servers on Amazon cloud. If I hadn’t been ‘working’ I’d have blogged screenshots and the workflow. As it is, all I have are vague memories of how it worked and what it did and the ideas that sprung off of having an artefact to talk around. [Hmm… app seems to have come back up – maybe I caught it at a bad time… https://notebookcloud.appspot.com/login ]

– provisioning things: chef, vagrant, puppet, docker.

What should I be using for what?

I thought about different VMs for different services, but that adds too much VM weight and requires networking between the VMs, we could lead to “support issues”. Would docker help here? A base VM built from vagrant and puppet, then docker to put additional machines on top.

What I want is students to be able to:

– install minimum number of third party apps on their desktop (currently virtualbox and vagrant)
– click one button get their VM. My feeling is we should have a largely prebuilt box on a USB stick they can use as a base box, then do a top up build and provision. I suspect CT would like one click somewhere to fire up a machine, get services running, and open a tab to the IPython notebook in their browser, maybe a status button somewhere, a single button to restart any services that have stopped and options to suspend or shutdown machines. In terms of student workflow, I think suspending and resuming machines (if services can resume appropriately) would be the neatest workflow. Note: the course runs over 9 months…
– be able to access files on host that are used in the VM. If they are using multiple VMs (eg on cloud and desktop) to find a sensible way of synching notebooks and data/database state across those machines; which could be tricky at least as far as database state goes.
– if a student is not using postgresql or mongo – and they won’t for the first 8 weeks of the course – it could be useful to run the VM without those services running (perhaps aside from a quick self-test in week 1 so we can check out any issues as early as possible and give ourselves a week or two to come up with any patches before those apps are used in anger). So maybe a control panel to fire up the VM and the services you want to run. Yes mongo, no postgresql. No DB at all. And so on. Would docker help here? Decide on host somehow which services to run, fire up the VM, then load in and start up the required services. Next session, change which services are running in the VM?

All in all, I reckon I’m between 20 and 40% there (further along in terms of time?) but not sure how much further I can push it to the desired level of robustness without learning how to do this stuff a bit more properly… I’m also not really sure what’s practically and reliably possible and what’s best for what. We have to maximise the robustness of stuff ‘just working’ and minimise support issues because students are at a distance and we don’t know what platform they’re on. I think if I started from scratch and rewrote the scripts now they’d be a lot clearer, but I know that’d take half a day and the functional return – for now – I think would be minimal.

That said, I’ve done a fair amount of learning, though large chunks of it have been lost to email and not blogging. Which is a shame. Must do better. Must take public notes.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...