Search results for: TM351

New Ed Tech Toys for TM351…

I did a thing earlier this week to the internal OU CALRG conference about some of my thinking ongoing at the moment around new edtech toys for “the data course”, TM351.

Annotated slides here: Imagining TM351: from virtual machines to notebooks.

Having presented it, the slides need reordering, a bit more emphasis needs to be placed on role human readable text can play in notebooks (h/t Alistair Willis for that observation), and I need to do quite a bit more thinking about the spreadsheet-notebook comparison.

Also to do are more thoughts on the “(non)linearity”/”serialisation” aspects of authoring, reading and executing/working through that I touched on in another talk, from last week: From storymaps to notebooks: do your computing one step at a time.

Rethinking the TM351 Virtual Machine Again, Again…

It’s getting to that time when we need to freeze the virtual machine build we’re going to use for the new (postponed) data course, which should hopefully go live to students in February, 2016, and I’ve been having a rethink about how to put it together.

The story so far has been documented in several blog posts and charts my learning journey from knowing nothing about virtual machines (not sure why I was given the task of putting it together?!) to knowing how little I know about building Linux administration, PostgreSQL, MongoDB, Linux networking, virtual machines and virtualisation (which is to say, knowing I don’t know enough to do any of this stuff properly…;-)

The original plan was to put everything into a single VM and wire all the bits together. One of the activities needed to fire up several containers as part of a mongo replica set, and I opted to use containers to do that.

Over the last few months, I started to wonder whether we should containerise everything separately, then deploy compositions of containers. The rationale behind this approach is that it means we could make use of a single VM to host applications for several users if we get as far as cloud hosting services/applications for out students. It also means students can start, stop or “reinstall” particular applications in isolation from the other VM applications they’re running.

I think I’ve got this working in part now, though it’s still very much tied to the single user – I’m doing things with permissions that would never be allowed (and that would possibly break things..) if we were running multiple users in the same VM.

So what’s the solution? I posted the first hints in Kiteflying Around Containers – A Better Alternative to Course VMs? where I proved to myself I could fire up an IPthyon notebook server on top of scientific distribution stack, and get the notebooks talking to a DBMS running in another container. (This was point and click easy, once you know what to click and what numbers to put where.)

The next step was to see if I could automate this in some way. As Kitematic is still short of a Windows client, and doesn’t (yet?) support Docker Compose, I thought I’d stick with vagrant (which I was using to build the original VM using a Puppet provision and puppet scripts for each app) and see if I could get it provision a VM to run containerised apps using docker. There are still a few bits to do – most notably trying to get the original dockerised mongodb stuff working, checking the mongo link works, working out where to try to persist the DBMS data files (possibly in a shared folder on host?) in a way that doesn’t trash them each time a DBMS container is started, and probably a load of other stuff – but the initial baby steps seem promising…

In the original VM, I wanted to expose a terminal through the browser, which meant pfaffing around with tty.js and node.js. The latest Jupyter server includes the ability to launch a browser based shell client, which meant I could get rid of tty.js. However, moving the IPython notebook into a container means that the terminal presumably has scope only within that container, rather than having access to the base VM command line? For various reasons, I intend to run the IPython/Jupyter notebook server container as a privileged container, which means it can reach outside the container (I think? The reason? eg to fire up containers for the mongo replica set activity) but I’m not sure if this applies to the command line/terminal app too? Though offhand, I can’t think why we might want to provide students with access to the base VM command line?

Anyway, the local set-up looks like this…

A simple Vagrantfile, called using vagrant up or vagrant reload. I have extended vagrant using the vagrant-docker-compose plugin that supports Docker Compose (fig, as was) and lets me fired up wired-together container configurations from a single script:

# -*- mode: ruby -*-
# vi: set ft=ruby :

Vagrant.configure("2") do |config|
  config.vm.box = "ubuntu/trusty64"

  config.vm.network(:forwarded_port, guest: 9000, host: 9000)
  config.vm.network(:forwarded_port, guest: 8888, host: 8351,auto_correct: true)

  config.vm.provision :docker
  config.vm.provision :docker_compose, yml: "/vagrant/docker-compose.yml", rebuild: true, run: "always"
end

The YAML file identifies the containers I want to run and the composition rules between them:

ui:
  image: dockerui/dockerui
  ports:
    - "9000:9000"
  volumes:
    - /var/run/docker.sock:/var/run/docker.sock
  privileged: true

ipynb:
  build: ./tm351_scipystacknserver
  ports:
    - "8888:8888"
  volumes:
    - ./notebooks/:/notebooks/
  links:
    - devpostgres:postgres
  privileged: true
    
devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data
 
devpostgres:
    environment:
        - POSTGRES_PASSWORD=whatever
    image: postgres
    ports:
        - "5432:5432"
    volumes_from:
        - devpostgresdata

At the moment, Mongo is still missing and I haven’t properly worked out what to do with the PostgreSQL datastore – the idea is that students will be given a pre-populated, pre-indexed database, in part at least.

One additional component that sort of replaces the command line/terminal app requirement from the original VM is the dockerui app. This runs in its own container with privileged access to the docker environment and that provides a simple control panel over all the containers:

DockerUI

What else? The notebook stuff has a shared notebooks directory with host, and is built locally (from a Dockerfile in the local tm351_scipystacknserver directory) on top of the ipython/scipystack image; extensions include some additional package installations (requiring both apt-get and pip installs) and copying across and running a custom IPython notebook template configuration.

FROM ipython/scipystack

MAINTAINER OU

ADD build_tm351_stack.sh /tmp/build_tm351_stack.sh
RUN bash /tmp/build_tm351_stack.sh


ADD ipynb_style /tmp/ipynb_style
ADD ipynb_custom.sh /tmp/ipynb_custom.sh
RUN bash /tmp/ipynb_custom.sh


## Extremely basic test of install
RUN python2 -c "import psycopg2, sqlalchemy"
RUN python3 -c "import psycopg2, sqlalchemy"

# Clean up from build
RUN rm -f /tmp/build_tm351_stack.sh
RUN rm -f /tmp/ipynb_custom.sh
RUN rm -f -r /tmp/ipynb_style

VOLUME /notebooks
WORKDIR /notebooks

EXPOSE 8888

ADD notebook.sh /
RUN chmod u+x /notebook.sh

CMD ["/notebook.sh"]

demo-tm351

If we need to extend the PostgreSQL build, that can be presumably done using a Dockerfile that pulls in the core image and then runs an additional configuration script over it?

So where am I at? No f****g idea. I thought that between the data course and the new web apps course we might be able to explore some interesting models of using virtual machines (originally) and containers (more recently) in a distance education setting, that could cope with single user home use, computer training room/lab use, cloud use, but, as ever, I have spectacularly failed to demonstrate any sort of “academic leadership” in developing these ideas within the OU, or even getting much of a conversation going in the first place. Not in my skill set, I guess!;-) Though perhaps not in the institution’s interests either. Recamp. Retrench. Lockdown. As per some of the sentiments in Reflections on the Closure of Yahoo Pipes, perhaps? Don’t Play Here.

A Peek Inside the TM351 VM

So this is how I currently think of the TM351 VM:

OU-VM-June2015Review_pptx

What would be nice would be a drag’n’drop tool to let me draw pictures like that that would then generate the build scripts… (a docker compose script, or set of puppter scripts, for the architectural bits on the left, and a Vagrantfile to set up the port forwarding, for example).

For docker, I wouldn’t have thought that would be too hard – a docker compose file could describe most of that picture, right? Not sure how fiddly it would be for a more traditional VM, though, depending on how it was put together?

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)

Progress Tracking Google Docs as Tasks?

As part of a new course I’m working on, the course team has been making use of shared Google docs for working up the course proposal and “D0” (zero’th draft; key topics to be covered in each of the weekly sessions). Although the course production hasn’t been approved yet, we’ve started drafting the actual course materials, with an agreement to share them for comment via Google docs.

The approach I’ve taken is to created a shared folder with the rest of the course teams, and set up documents for each of the weekly sessions I’ve taken the lead on.

TM351 files

The documents in this folder are all available to other members of the course team – for reference and /or comment – at any time, and represent the “live”/most current version of each document I’m working on. I suspect that others in the course team may take a more cautious approach, only sharing a doc when it’s in a suitable state for handover – or at least, comment – but that’s fine too. My docs can of course be used that way as well – no-one has to look at them until I do “hand them over” for full comment at the end of the first draft stage.

But what if others, such as the course team chair or course manager, do want to keep check on progress over the coming weeks?

The file listing shown above doesn’t give a lot away about the stare of each document, not even a file size, only when it was last worked on. So it struck me that it might be useful to have a visual indicator (such as a horizontal progress bar) about the progress on each document so that someone looking at the listing would know whether there was any point opening a document to have a look inside at all…

..because at the current time, a lot of the docs are just stubs, identifying tasks to be done.

Progress could be measured by proxy indicators, such as file size, “page count” equivalent, or line count. In these cases, the progress meter could be updated automatically. Additional insight could be provided by associating a target line count or page length metadata element, providing additional feedback to the author about progress with respect to that target. If a document exceeds the planned length, the progress meter should carry on going, possibly with a different colour denoting the overrun.

There are a couple of problems at least with this approach – documents that are being worked on may blend scruffy working notes along with actual “finished” text; several versions of the same paragraph may exist as authors try out different approaches, all adding to the line count. Long copied chunks from other sources may be in the text as working references, and so on.

So how about an additional piece of metadata for docs additionally tagged as “task” type in which a user can set a quick progress percentage estimate (a slider widget would make this easy to update) that is displayed in a bar on the file listing. Anyone checking the folder could then – at a glance – see which docs were worth looking at based on progress within the document-as-task. (Of course, having metadata available also opens up the possibility of additional mission creeping features, rulesets for generating alerts when a doc hits a particular percentage completion, for example.)

I’m not looking for more project management tools to take time away from a task, but in this case think the simple addition of a “progress” metadata element could weave an element of project management support into this sort of workflow? (changing the title of the doc would be another way – eg adding (20% done) to the title…

Thinks: hmm, I procrastinating, aren’t I? I should really be working on one of those docs…;-)

Confused Again About VM Ecology… I Blame Not Blogging

Via a cc’d tweet from Martin Hawksey, this lovely post from Tom Smith/@everythingabili (who has the best ever twitter bio strapline) on How I Learn ( And What I’m Learning ).

I like to think that I used to write blog posts that had the same sort of sense as that post…

…but for the last few months at least, I don’t think I have.

“Working” for once – starting production on an OU course (TM351, due out October 2015 (sic; I’m gonna be late on the first draft of the 7 weeks of the course I’m responsible for: it’s due in a little over a fortnight…), and also helping out on an investigative project the School of Data is partnering on – has meant that most of the learnings and failings that I used to blog about have been turned inward to emails (which are even more poorly structured in terms of note-taking than this blog is) if at all.

Which is a shame and makes me not happy.

Reading through completed academic papers, making useful (I hope) use of them in the course draft, has been quite fun in part – and made me regret at times not writing up work of my own in a self-contained, peer reviewed way over the last decade or so; getting stuff “into the record”, properly citable, and coherent enough to actually be cited. But as you pick away at the papers, you know they’re a revisionist telling, a straightforward narrative of how the pieces fit together and in which nothing went wrong along the way; (you also know that the closer you get to trying to replicate a paper, the harder it is to find the missing pieces (process, trick, equation or insight) that actually make it work; remember school maths, or physics, and the textbook that goes from one equation to the next with a “hence”, but there’s no way in hell you can figure out how to make that step and you know you’ll be stuck when that bit comes up in the exam…?! That. Or worse. When you bang your head against a wall trying to get something to work, contort your mental models to make it work, sort of, then see the errata list correcting that thing. That, too.)

On the other hand, this blog is not coherent, shapes no whole, but is full of hence steps. Disjointed ones, admittedly. But they keep a track of all the bits I tried at and failed at and worked around, and they keep on getting me out of holes… Like the emails won’t. Lost. Wasted effort because the learning fumblings that are OUseful learning fumblings are lost and locked up in email hell.

It makes me very not happy.

So that, by way of intro, to this: a quick catchup follow-up to Cursory Thoughts on Virtual Machines in Distance Education Courses and Doodling With IPython Notebooks for Education, a partial remembering of the various shades of hell associated with them and trying to share them.

Here’s what I think I now want to do (whether or not it’s the right thing I’m not sure).

  • generate a script that will build a VM. We’ve opted for Virtualbox as the VM container. The VM will need to contain: pandas; IPython notebook (course team want it to run Python 3.3. I’ve lost track of how many hours I’ve spent trying and failing to get Python libraries I think we need trying to run under Python 3.3; wasted effort; I should have settled with Python 2.7 and then revisited 3.3 in several months hence; the 2.7 3.3 tweaks to any code we write for the course should manageable in migration terms. Pratting around on libraries that I’m guessing will get patched as native distributions move to 3.3 by default but don’t work yet is wasted effort. October. 2015. First presentation.); PostgreSQL (perhaps with some extensions); mongodb; ipythonblocks; blockdiag; I came across shellinabox today and wonder if we should run that; OpenRefine (CT against this – I think it’s good for developing mental models); python-nvd3; folium; a ggplot port to python; (CT take – too much new stuff; my take, we should go as high up the stack as we can in terms of the grammar of the calling functions); I think we should run R and RStudio too to make for a really useful VM, making the point that the language doesn’t matter and we use whatever’s appropriate to get things done, but I don’t think anyone else does. if. Which computer language is that from then? for. Or that one? How about in? print? Cars are for driving. Mine’s blue. I have no idea how it works. Can I drive your car? The red one. With the left-hand drive.
  • access the services running on the headless VM via a browser on host. I think we should talk to the databases using Python, but I may be in the minority. We may need something more graphical to talk to postgresql. If we do, I would argue it should be a browser based client – if it’s an app, we’re moving functionality provision outside of the VM.
  • use the script to build to machines with the same package config; CT seem to prefer a single build on a 32 bit machine. I think we should support 64 bit as well. And deployment on at least one cloud service – I;d go for Amazon, but that’s mainly because it’s the only one I’ve tried. If we could support more, even better.
  • as far as maintenance goes, I wrote the vagrant script to update libraries whenever the provisioner is run (which is quite a lot at the mo as I keep finding new things to add to the box!;-) This may or may not be sensible for student use. If there is a bug in a library, an update could help. If there is a security patch to the kernel, we should be updating as good citizens. The current plan is to ship a built box (which I think would have to go on to a USB stick – we can’t rely on folk having computers with a DVD any more, and a 1.5GB download seems to be proving unreliable without a proper download manager. As it is, students will have to download virtualbox and vagrant, and install those themselves. (Unless we can put installers for them on a USB stick too.) If we do ship a built box, we need to think of some way of kickstarting the services and perhaps rebooting the machine (and then kickstarting the services). There is a separate question of whether we should be also be able to update config scripts during presentation. This would presumably have to be done on the host. One way might be to put config scripts on a git server then use git to keep the config scripts on the students’ host machine up to date, but that would probably also require them to install a git commandline tool, even if we automated the rest. Did I say this all has to run cross platform? Students may be on Windows (likely?), Mac or Linux. I think the course should be doable, if painfully, via a tablet, which means the VM needs the cloud hosted option. If we could also find a way to help students configure their whatever platform host so that they could access services from the VM running on it via their tablet, so much the better.
  • files need to be shared between VM and host. This raises an interesting issue for a cloud hosted VM. Either, we need to find a way to synch files between desktop and cloud VM, persist state on the cloud host so that the VM can synch to it, or pop dropbox into the cloud VM (though there would then be a synch delay, as there would with a desktop synch). I favour persisting on the cloud, though there is then the question of the student who is working on a home machine one day and a cloud machine the next.
  • Starting and stopping services: students need to be able to start and stop services running on the VM without having to ssh in. Once click easy. A dashboard with buttons that show if a service is running or not, click a button to toggle the run state of the the service. No idea how to do this.

Here’s the approach I’ve taken:

  • reusing DataminerUK’s infinite-interns model as a starting point, I’m using vagrant to build and provision a VM using puppet. At the moment I have duplicate setups for two different Linux base boxes (precise64 and raring64. The plan is to move to the latest Ubuntu LTS.) I really should have a single setup with the different machine setups called by name from a single Vagrantfile. I think.
  • The puppet provisioner builds the box from a minimal base and starts the services running. It’s aggressive on updates. The precise64 box is running python 2.7 and the raring64 box 3.3. Getting pip3 running in the raring box was a pain, and I don’t know how to tell puppet to use the pip3 thing I eventually installed to update. At the moment I fudge with:
    exec { "pip3-update":
    command => "/usr/local/bin/pip3 install --upgrade pip"
    }

    but it breaks because I’m not convinced that is always the right path (I’d like to hedge on /usr/bin:/usr/local/bin), or that pip3 is actually installed when I try to exec it… I think what I really want to do is something like
    package {
    [
    JUST UPGRADE YOURSELF, PLEASE
    ]: ensure => latest,
    provider => 'pip3';
    }

    with an additional dependency check (=>) that pip3 has been installed first, and from all the other pip3 installs that pip3 has been upgraded first.
  • The IPython notebook is started by a config shell script called from puppet. I think I’m also using a config script to set up a user and test tables in Postgres (though I plan to move to the puppet extension as soon as I can get round to reading the docs/finding out how to do it).
  • There are times when a server needs restarting. At the moment I have to run vagrant provision to do that – or even vagrant halt;vagrant up, which means it also runs the updates. It’d be nice to just be able to run the bits that restart the services (the DBMS’, IPython notebook etc) without doing any of the package updates, installs, checks etc.
  • We need a tool to check whether services are running on the necessary ports to help debugging without requiring a user to SSH into the VM; I’ve also fixed on default ports. We really need to change ports if a default port is being used to a free port and then somehow tell the IPython notebook scripts which port each service is running on. With vagrant letting you run a VM from within a particular directory, being able to know what VMs are being run and from where, wherever vagrant on host started them, would be useful.
  • I don’t use a configurator for the postgres db (it needs seeding with some example tables) but should do – on my to do list is to look at https://github.com/puppetlabs/puppetlabs-postgresql . Similarly for mongo db – and perhaps https://github.com/puppetlabs/puppetlabs-mongodb
  • To make use of python-nvd3, suggested route is to use bower. I got the npm package manager to work but have failed to find a way of installing any actual packages [issue].

Issues to date, aside from things like port clashes and all manner of f**k ups because I distributed a README with an error in it and folk tried to follow it rather than patches posted elsewhere, have been impeded by not having a good way of logging and sharing errors. OU network issues have also contributed to the fun. I always avoid the OU staff network, but nothing seems to work on that. I suspect this is a proxy issue, but I’m unwilling to invest any time exploring it or how to config the VM to cope (no-one else has offered to go down this rat hole). Poxy proxies could well be an issue for some students, but I’m not sure what the best way forward is. Cloud hosted VMs?!

I also had a problem on OU eduroam – mongodb wants to get something signed from a keyserver before it will install mongodb, but OU eduroam (the network I always use) seems to block the keyserver. Tracking that down wasted a good hour.

Here are some other things I’ve heard about:

https://github.com/psychemedia/notebookcloud This is cloned from https://github.com/carlsmith who appears to have taken his repo – and the app – down? It provided a dashboard for firing up notebook servers on Amazon cloud. If I hadn’t been ‘working’ I’d have blogged screenshots and the workflow. As it is, all I have are vague memories of how it worked and what it did and the ideas that sprung off of having an artefact to talk around. [Hmm… app seems to have come back up – maybe I caught it at a bad time… https://notebookcloud.appspot.com/login ]

– provisioning things: chef, vagrant, puppet, docker.

What should I be using for what?

I thought about different VMs for different services, but that adds too much VM weight and requires networking between the VMs, we could lead to “support issues”. Would docker help here? A base VM built from vagrant and puppet, then docker to put additional machines on top.

What I want is students to be able to:

– install minimum number of third party apps on their desktop (currently virtualbox and vagrant)
– click one button get their VM. My feeling is we should have a largely prebuilt box on a USB stick they can use as a base box, then do a top up build and provision. I suspect CT would like one click somewhere to fire up a machine, get services running, and open a tab to the IPython notebook in their browser, maybe a status button somewhere, a single button to restart any services that have stopped and options to suspend or shutdown machines. In terms of student workflow, I think suspending and resuming machines (if services can resume appropriately) would be the neatest workflow. Note: the course runs over 9 months…
– be able to access files on host that are used in the VM. If they are using multiple VMs (eg on cloud and desktop) to find a sensible way of synching notebooks and data/database state across those machines; which could be tricky at least as far as database state goes.
– if a student is not using postgresql or mongo – and they won’t for the first 8 weeks of the course – it could be useful to run the VM without those services running (perhaps aside from a quick self-test in week 1 so we can check out any issues as early as possible and give ourselves a week or two to come up with any patches before those apps are used in anger). So maybe a control panel to fire up the VM and the services you want to run. Yes mongo, no postgresql. No DB at all. And so on. Would docker help here? Decide on host somehow which services to run, fire up the VM, then load in and start up the required services. Next session, change which services are running in the VM?

All in all, I reckon I’m between 20 and 40% there (further along in terms of time?) but not sure how much further I can push it to the desired level of robustness without learning how to do this stuff a bit more properly… I’m also not really sure what’s practically and reliably possible and what’s best for what. We have to maximise the robustness of stuff ‘just working’ and minimise support issues because students are at a distance and we don’t know what platform they’re on. I think if I started from scratch and rewrote the scripts now they’d be a lot clearer, but I know that’d take half a day and the functional return – for now – I think would be minimal.

That said, I’ve done a fair amount of learning, though large chunks of it have been lost to email and not blogging. Which is a shame. Must do better. Must take public notes.

Anscombe’s Quartet – IPython Notebook

Anyone who’s seen one of my talks that even touches on data and visualisation will probably know how it like to use Anscombe’s Quartet as a demonstration of why it makes sense to look at data, as well as to illustrate the notion of a macroscope, albeit one applied to a case of N=all where all is small…

Some time ago I posted a small R demo – The Visual Difference – R and Anscombe’s Quartet. For the new OU course I’m working on (TM351 – “The Data Course”), our focus is on using IPython Notebooks. And as there’s a chunk in the course about dataviz, I feel more or less obliged to bring Anscombe’s Quartet in:-)

As we’re still finding our way about how to make use of IPython Notebooks as part of an online distance education course, I’m keen to collect feedback on some of the ways we’re considering using the notebooks.

The Anscombe’s Quartet notebook has quite a simple design – we’re essentially just using the cells as computed reveals – but I’m still be keen to hear any comments about how well folk think it might work as a piece of standalone teaching material, particularly in a distance education setting.

The notebook itself is on github (ou-tm351), along with sample data, and a preview of the unexecuted notebook can be viewed on nbviewer: Anscombe’s Quartet – IPython Notebook.

Just by the by, the notebook also demonstrates the use of pandas for reshaping the dataset (as well as linking out to a demonstration of how to reshape the data using OpenRefine) and the ŷhat ggplot python library (docs, code) for visualising the dataset.

Please feel free to post comments here or as issues on the github repo.

PS see also this technique for generating Same Stats, Different Graphs: Generating Datasets with Varied Appearance and Identical Statistics through Simulated Annealing.

Seven Ways of Running IPython / Jupyter Notebooks

We’re looking at using IPython notebooks for a MOOC on something or other, so here’s a quick review of the different ways I think we can provide access to them. Please let me know via the comments if there are other models…

[NOTE: I try to keep this post updated as more services/examples become available…]

User Desktop – Native App

Download a version of a scientific python distribution such as Anaconda (python 2.7 & 3) or Enthought Canopy (python 2.7) and run the notebook from within that. Runs cross platform, requires user admin privileges to install application. For Mac users, there’s a handy looking standalone app, Pineapple. (I’m not sure if there’s a Windows equivalent? For now, here’s a recipe for rolling your own standalone Jupyter notebook on Windows.)

There does look to be a cross-platform, standalone electron app in development – nteract – but the implication is that you currently need to do some compilation yourself to get it to work….

In the context of a MOOC, this approach would require participants to download and install the python distribution on a desktop or laptop computer. This approach will not work on a tablet.

In terms of supporting the ability to open a notebook directly by double clicking on a notebook file, this nbopen looks like it may do the trick?

Delivery/use costs: none.
Publisher demands: notebook development.
Maintenance: dependence on distribution provider.
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: can be installed after running a config script.

Browser Extension

The CoLaboratory (about) Chrome extension allows you to run IPython notebooks within the Chrome browser without the need to install any other software. Notebooks are saved to/opened from Google Drive. Python 2.7(?). Requires Google Chrome (cross-platform), Google Account (for Google Drive integration). This approach does not support the installation of arbitrary third party python libraries – only libraries compiled into the extension will work.

In the context of a MOOC, this approach would require participants to download and install Chrome and the CoLaboratory extension. This approach will not work on a tablet.
Delivery/use costs: none.
Publisher demands: notebook development.
Maintenance: dependence on extension publisher; (code available to fork).
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: not supported(?)

iOS App – NO LONGER AVAILABLE

As of March 2015 – no longer available from appstore
An iOS app, Computable (blog) makes notebooks available on an iPad. The app is free to preview notebooks bundled with the app, but requires a $10 in-app purchase to run your own notebooks. Integrated with Dropbox. Source code not available. Other reports of demos – but again, no code – available, such as: IPython notebook on iPhone. I don’t know if this approach supports the installation of arbitrary third party python libraries; that is, I don’t know if only libraries compiled into the application will work.

In the context of a MOOC, this approach would require participants to download and install the app and then pay the $10 fee. A Dropbox account is also required(?). This approach will only work on an iOS device.
Delivery/use costs: $10 to student.
Publisher demands: notebook development.
Maintenance: dependence on app publisher.
Support issues: installation of 3rd party software.
Custom branding/styling/extensions: not supported(?)

User Desktop – Virtual Machine

Run an IPython server within a virtual machine on the user’s desktop, exposing the notebook via a browser. Requires a virtual machine runner (eg VirtualBox, VMWare), port forwarding; some mechanism for starting up the VM and auto-running the notebook server. Devops tools may be used to deploy VMs either locally or in the cloud (eg JiffyLab (about Jiffylab); the OU course TM351 (in production) is currently exploring the use of VMs managed using vagrant to deliver IPython notebooks along with a range of other services).

Pre-defined IPython notebook VM examples: docker: ipython/notebook; Chef: IPython notebook cookbook.

In the context of a MOOC, this approach would require participants to download and install a virtual machine runner and the virtual machine image. This approach will not work on a tablet.
Delivery/use costs: none.
Publisher demands: development of VM image.
Maintenance: tracking VM runner compatability.
Support issues: installation of 3rd party software; installation of VM; port forwarding conflicts, locating the notebook server address.
Custom branding/styling/extensions: yes.

In the Cloud – Managed Services

Open a notebook running on a managed IPython notebook server such as WakariAuthorea or CoCalc (which was originally known as SageMathCloud). Several of the bigger players also make notebooks available as part of a wider “studio” offering, such as the IBM DataScientist Workbench (review) or Microsoft Azure Machine Learning (review).  Kaggle also hosts Python and R notebooks for working with Kaggle datasets. Free plans typically available, often requiring personal account on managed service, internet connection.

In the context of a MOOC, this approach would require participants to have access to a network connection and create an account on a provider service. This approach will work on any device.
Deliver/use costs: Notebooks created under free plans will be public.
Custom branding/styling/extensions: no – branding may be associated with provider; provider identified extensions may be offered as part of hosted service.

In the Cloud – Ad Hoc tmp Notebooks

Nature recently did a splash on IPython notebooks (Interactive notebooks: Sharing the code) that included a demo notebook for readers to experiment with: IPython interactive demo.

The system uses a ‘temporary notebook’ server described here: Instant Temporary IPython Notebooks (code: Jupyter tmpnb). The server spawns temporary notebooks that can be used to run particular activities. Other example use cases include the provision of notebooks at conferences/workshops for demo purposes, hour of code demos etc. (tmpnb also provides the basis for the try.jupyter.org demo service.)

In the context of a MOOC, this approach would require participants to have access to a network connection. Notebooks will be temporary and cannot be persisted. This approach would be ideal for one off activities. This approach will work on any device. My assumption is that we would not be able direct participants to the current 3rd party provider of this service without prior agreement.
Delivery/use costs: server provision for running notebooks.
Publisher demands: management of cloud services if self-hosted.
Maintenance: tracking VM runner compatability.
Support issues:
Custom branding/styling/extensions: no.

In the Cloud – Github Spawned Notebooks

Project Binder is “[a] system for deploying a collection of Jupyter notebooks and their dependencies as live, interactive notebooks, straight from GitHub repositories, across one or more nodes of a Kubernetes cluster”. Users provide Github repository details to the binder service and a notebook server is launched in a docker container that then references the notebooks on Github. Additional services, such as PostgreSQL, can also be linked in. Note that the user does not need to be the owner of the Github repository – it is simply used as the source for the notebooks. It’s also possible to run the MyBinder notebooks using an R kernel.

Delivery/use costs: server provision for running notebooks.
Publisher demands: management of cloud services if self-hosted.
Maintenance:
Support issues:
Custom branding/styling/extensions: no (unless you host the notebook runner)

In the Cloud – Hosted VMs

Another virtual machine approach, but participants fire up a prebuilt virtual machine in the cloud. Examples include: Yhat Sciencebox. Another example is the notebookcloud, a semi-managed service built around Google authentication and user’s personal AWS credentials; ((forked) legacy code). (From the docs: NotebookCloud is service that allows you to launch and control IPython Notebook servers on Amazon EC2 from your browser. This enables you to host your own Python programming environment, on your own Amazon virtual machine, and access it from any modern web browser.)

In the context of a MOOC, this approach would require participants to have access to a network connection and create an account on a provider service. This approach will work on any device.
Delivery/use costs: participant pays machine, storage and bandwidth costs according to usage.
Publisher demands: management of cloud services if self-hosted. Alternatively, we could host a version of NotebookCloud.
Maintenance: potential reliance on 3rd party VM configuration.
Support issues: account creation, VM management.
Custom branding/styling/extensions: no – branding may be associated with provider; provider identified extensions may be offered as part of hosted service.

Hosted MultiUser Systems

IPython and IPython Notebooks are under active development and it looks as if future releases will support multi-user services, for example JupyterHub: A multi-user server for Jupyter notebooks.

Docker Containers

Another take on virtualisation, IPython notebooks can be launched within docker containers within a lightweight virtual machine either on the desktop or in the cloud. For example, see Kiteflying Around Containers – A Better Alternative to Course VMs? for an example of launching a notebook on the desktop (currently (June, 2015) Mac/Linux only) or in the cloud, see this for a related example: Getting Started With Personal App Containers in the Cloud.

Delivery/use costs: participant pays machine, storage and bandwidth costs on cloud hsot according to usage; desktop use is free.
Publisher demands: management of cloud services if self-hosted. Application code is self-contained and can represent a managed distribution.
Maintenance: potential reliance on 3rd party defined containers.
Support issues: account creation, docker management.
Custom branding/styling/extensions: yes, within a custom container.

Summary

In order to make interactive use of IPython notebooks within a course, we need to make notebooks available, and provide a way of running them. A wide variety of models that support the running of notebooks exists. We can distinguish: where does the notebook server run; where does it load pre-existing notebooks from; where does it save notebooks to.

Running a notebook server incurs a resource cost in terms of installation, maintenance, (remote) access (eg managing multiple instances, port availability etc when offering a hosted service), any financial costs associated with running the service. Who covers the costs and meets any load issues depends on the solution adopted.

As far as usage goes, notebooks are accessed via a web browser and as such are accessible on any device. However, if a server runs locally rather than on a remote host, there are device dependencies to contend with that rule out some solutions on some platforms (VM can’t run on iOS; iOS app won’t run on Windows machine etc).

In an online course environment, we may be able separate concerns and suggest a variety of ways of running notebooks to participants, leaving it up to each participant to find a way of running the notebooks that works for them. Duty of care might then extend only insofar as making notebooks available that will run on all the platforms we have recommended.

In terms of pedagogy, we might distinguish between notebooks as used:

  1. to run essentially standalone exercises, and which we might treat as disposable (a model which the tmpnb solution fits beautifully because no persistence of state is required); this exercises might be pre-prepared notebooks made available to participants, or user created notebooks that users create as a scratch pad to run through a particular set of activities described elsewhere;
  2. for activities where the participant may want to save a work in progress and return to it at a later date (which requires persistence of state on a per user basis); this might be in the context of a notebook we provide to the students, or one they have created and are working on themselves;
  3. for activities where participants create their own notebooks and wish to preserve them.

From the OU perspective, we should probably take a view on the extent to which we develop solutions that work across one or more contexts, such as MOOCs delivered ex- of OU provisioned services (eg FutureLean); MOOCs delivered via an OU context (eg OpenLearn); courses delivered to OU fee paying students via OU systems.

There may be opportunities to develop solutions that work to support the delivery of OU courses, as well as OU MOOCs, and that are further licensable to other institutions, eg to support their own course delivery or MOOC delivery, or appropriate for release as open software in support of the wider community.

Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

notebooks:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/notebooks:/notebooks"
data:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/data:/data"

devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data

devpostgres:
    environment:
        - POSTGRES_PASSWORD
    image: postgres
    ports:
        - "5433:5432"
    volumes_from:
        - devpostgresdata

notebook:
    environment:
        - PASSWORD
    image: calvingiles/data-science-environment
    links:
        - devpostgres:postgres
    ports:
        - "443:8888"
    volumes_from:
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm
PASSWORD=MyPass POSTGRES_PASSWORD=PGPass fig up -d

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os
os.environ

Adding Metadata to Google Docs

A couple of months ago I had started working on an export tool that would export a Google doc in the OU-XML format. The rationale? The first couple of drafts of the teaching material that will be delivered through the VLE in the forthcoming (October, 2015) OU Data management and analysis course (TM351) have been prepared in Google docs, and the production process will soon have to move to the Open University’s XML workflow. This workflow is built around an OU defined schema, often referred to as OU-XML (or OUXML), and is supported by a couple of oXygen XML editor extensions that make it easy to preview rendered versions of the documents in a test VLE site.

The schema itself includes several elements that are more akin to metadata elements than actual content – things like the course code, course title, for example, or the byline (or lead author) of a particular unit.

Support for a small amount of metadata is provided by Google Drive, but the only easily customisable element is a free text description element.

gdocsMetadata

So whilst patching a couple of “issues” today with the Google Docs to OU-XML generator, and adding a menu option that allows users to create a zip file in Google Drive that contains the OU-XML and any associated image files for a particular Google doc, I thought it might also be handy to add some support for additional metadata elements. Google Drive apps support a Properties class that allows metadata properties represented as key-value pairs to be associated with a particular document, user or script. Google Apps Script can be used to set and retrieve these properties. In addition, Google Apps Script can be used to generate templated HTML user interface forms that can be used to extend Google docs or spreadsheets functionality.

In particular, I created a handful of Google Apps Script functions to pop up a templated panel, save metadata descriptions entered into the metadata form as document properties, and retrieve the value of a particular metadata element.

//Pop up the metadata edit/display panel
//The document is created as a templated HTML document
function metadataView() {
  // Generate the HTML
  html= HtmlService
      .createTemplateFromFile('metadata')
      .evaluate()
      .setSandboxMode(HtmlService.SandboxMode.IFRAME);
  //Pop up a panel and render the HTML describing the metadata form inside it
  DocumentApp.getUi().showModalDialog(html, 'Metadata');
}

//This function sets the document properties from the metadata form elements
function processMetadataForm(theForm) {
  var props=PropertiesService.getDocumentProperties()
  //Process each form element (atm, they are just input text elements)
  for (var item in theForm) {
    props.setProperty(item,theForm[item])
    Logger.log(item+':::'+theForm[item]);
  }
}

The templated HTML form is configured using a set of desired metadata elements. Each element is described using a label that is displayed in the form, an attribute name which should be a single word) and an optional default value. The template also demonstrates how we can call a server side Apps Script function from the dialogue using the google.script.run.FUNCTION_NAME construction.

<script src="//ajax.googleapis.com/ajax/libs/jquery/1.9.1/jquery.min.js"></script>
  
<? 
//Add metadata fields here in the following format:
//[Label, a unique identifier (unique word, no spaces or punctuation), an optional default value]
var metadataItems =[
    ["Lead Author","leadAuthor"],
    ["Course Code","courseCode"],
    ["Course Title","courseTitle"],
    ["Unit Title","unitTitle"],
    ["Rendering","rendering","VLE2 staff (learn3)"]
]
?>
  
<? var metadata = PropertiesService.getDocumentProperties() ?>
<script>
//When the metadata has been successfully saved as document properties
//  close the metadata form panel
function onSave() {google.script.host.close()}
</script>
  
<form id='metadataForm'>
<!-- Construct a set of form elements, one for each metadata item -->
<? for (var i = 0; i < metadataItems.length; i++) { ?>
  <div><?= metadataItems[i][0] ?>: 
    <input type="text"
      name = "<?= metadataItems[i][1] ?>"
      <? val=''
        if (metadataItems[i].length>2) val= metadataItems[i][2]  ?>
      value= "<?= metadata.getProperty(metadataItems[i][1]) ? metadata.getProperty(metadataItems[i][1])  : val  ?>"
    /> 
  </div>
<? } ?>
    
</form>
  
<div>
  <input
    type="button"
    value="Save & Close"
    onclick="google.script.run.withSuccessHandler(onSave).processMetadataForm(document.getElementById('metadataForm'))"
  />
  
  <input
    type="button"
    value="Cancel"
    onclick="google.script.host.close()"
  />
</div>

When the metadataView() function is called from the Add-Ons menu, it pops a dialogue that looks (in unstyled form) something like this:

googleDocMetadata

Metadata elements are loaded in to the form if they exist or a default value is specified.

When generating the export OU-XML, a helper function grabs the value of the relevant metadata element from the document properties. This value then then be inserted into the OU XML at the appropriate point.

//A helper function to display a particular metadata element
//This function is called from the metadata form
function getProp(key) {
  var props= PropertiesService.getDocumentProperties()
  return props.getProperty(key) ? props.getProperty(key) : '';
}

var COURSECODE= getProp('courseCode');

One issue with this approach is that if we have lots of documents relating to different units for the same course, we may need to enter the same values for several metadata elements across each document (for example, the course code and course title). Unfortunately, Google Drive does not support arbitrary properties for folders. One solution, suggested by Tom Smith/@everythingabili was to use the description element for a folder to store JSON represented metadata. I think we could actually simplify that, using a line based representation or a simple delimited representation that we can easily split on, something like:

courseCode :: TM351;;
courseTitle:: Data Management and Analysis

for example. We could then split on ;; to get each pair, strip whitespace, split on :: and strip whitespace again to get the key:value elements for each metadata item.

gdocsfoldermetadata

I guess one way of getting the folder decription given a particular document as a starting point is to find the parent folder using file#getParents() perhaps?) and then call folder#getDescription()?

Another approach might be to have a dummy, canonically named file in each folder (metadata for example), that we add metadata to, and then whenever we open a new file in the folder we look for the metadata file, get its metadata property values, and use those to seed the metadata values for our content document.

Finally, it’s maybe worth also pondering the issue of generating the OU-XML export for all the documents within a given folder? One way to do this might be to create a function off a each document that will find the parent folder, find all the files (except, perhaps, a metadata file?!) in that folder, and then run the OU-XML generator over all of them, bundling them up into a single zip file, perhaps with a directory structure that puts the OU XML for each document, along with any image files associated with it, into separate folders?

Only it probably isn’t.. I suspect that if the migration to the OU-XML format, if it hasn’t already happened, will involve copying and pasting…

PS for completeness, the menu option can be installed as follows:

function onOpen(e) {
  DocumentApp.getUi().createAddonMenu()
      .addItem('Metadata','metadataView')
      .addToUi();
}