Tagged: TM351

Recreating a Node.js Installation – Package Versions

Rebuilding a fresh version of the TM351 VM from scratch yesterday, I got an error trying to install tty.js, a node.js app that provides a “terminal desktop in the browser”.

vagrant_tm351lts_ttyjs

Looking into a copy of the VM where tty.js does work, I could discover the version of node I’d previously successfully used, as well as check all the installed package versions:

### Show nodejs version and packages
> node -v
v0.10.35

> npm list -g
/usr/local/lib
├─┬ npm@1.4.28
│ ├── abbrev@1.0.5
│ ├── ansi@0.3.0
│ ├── ansicolors@0.3.2
│ ├── ansistyles@0.1.3
│ ├── archy@0.0.2
│ ├── block-stream@0.0.7
...
│ └── which@1.0.5
└─┬ tty.js@0.2.13
  ├─┬ express@3.1.0
  │ ├── buffer-crc32@0.1.1
...

Using this information, I could then use nvm, a node.js version manager, installed via:

curl https://raw.githubusercontent.com/creationix/nvm/v0.23.3/install.sh | NVM_DIR=/usr/local/lib/ bash

to install, from a new shell, the version I knew worked:
nvm install 0.10.35
npm install tty.js

(I should probably add the tty.js version in there too? npm install tty.js@0.2.13 perhaps? )

The terminal can then be run as a demon from:

/usr/local/lib/node_modules/tty.js/bin/tty.js --port 3000 --daemonize

What this got me wondering was: are there any utilities that let you capture a nodejs configuration, for example, and the recreate it in a new machine. That is, export the node version number and versions of the installed packages, then create an installation script that will recreate that setup?

It would be handy if this approach could be extended further. For example, we can also look at the packages – and their version numbers – installed on the Linux box using:

### Show packages
dpkg -l

And we can get a list of Python packages – and their version numbers – using:

### Show Python packages
pip3 list

Surely there must be some simple tools/utilities out that support this sort of thing? Or even just cheatsheets that show you what commands to run to export the packages and versions into a file in a format that allows you to use that file as part of an installation script in a new machine to help rebuild the original one?

Notes on Data Quality…

You know how it goes – you start trying to track down a forward “forthcoming” reference and you end up wending your way through all manner of things until you get back to where you started none the wiser… So here’s a snapshot of several docs I found trying to source the original forward reference for following table, found in Improving information to support decision making: standards for better quality data (November 2007, first published October 2007) with the crib that several references to it mentioned the Audit Commission…

datainfoKnowledge

The first thing I came across was The Use of Information in Decision Making – Literature Review for the Audit Commission (2008), prepared by Dr Mike Kennerley and Dr Steve Mason, Centre for Business Performance Cranfield School of Management, but that wasn’t it… This document does mention a set of activities associated with the data-to-decision process: Which Data, Data Collection, Data Analysis, Data Interpretation, Communication, Decision making/planning.

The data and information definitions from the table do appear in a footnote – without reference – in Nothing but the truth? A discussion paper from the Audit Commission in Nov 2009, but that’s even later… The document does, however, identify several characteristics (cited from an earlier 2007 report (Improving Information, mentioned below…), and endorsed at the time by Audit Scotland, Northern Ireland Audit Office, Wales Audit Office and CIPFA, with the strong support of the National Audit Office), that contribute to a notion of “good quality” data:

Good quality data is accurate, valid, reliable, timely, relevant and complete. Based on existing guidance and good practice, these are the dimensions reflected in the voluntary data standards produced by the Audit Commission and the other UK audit agencies
* „Accuracy – data should be sufficiently accurate for the intended purposes.
* Validity – data should be recorded and used in compliance with relevant requirements.
„* Reliability – data should reflect stable and consistent data collection processes across collection points and over time.
„* Timeliness – data should be captured as quickly as possible after the event or activity and must be available for the intended use within a reasonable time period.
„* Relevance – data captured should be relevant to the purposes for which it is used.
„* Completeness – data requirements should be clearly specified based on the information needs of the body and data collection processes matched to these requirements.

The document also has some pretty pictures, such as this one of the data chain:

datachain

In the context of the data/information/knowledge definitions, the Audit Commission discussion document also references the 2008 HMG strategy document Information matters: building government’s capability in managing knowledge and information which includes the table in full; a citation link is provided, but 404s, but a source is given to the November 2008 version of Improving information, the one we originally started with. So the original reference forward refers the table to an unspecified report, but future reports in the area refer back to that “original” without making a claim to the actual table itself?

Just in passing, whilst searching for the Improving information report, I actually found another version of it… Improving information to support decision making: standards for better quality data, Audit Commission, first published March 2007.

twireports

The table and the definitions as cited in Information Matters do not seem to appear in this earlier version of the document?

PS Other tables do appear in both versions of the report. For example, both the March 2007 and November 2007 versions of the doc contain this table (here, taken from the 2008 doc) of stakeholders:

stakeholders

Anyway, aside from all that, several more documents for my reading list pile…

PS see also Audit Commission – “In the Know” from February 2008.

Using Docker to Build Linked Container Course VMs

Having spent bits of last year tinkering with vagrant and puppet as part of a workflow for building and deploying course related VMs in a scaleable way for a distance education context (trying to be OUseful here…) I’ve more recently started pondering whether it makes more sense to create virtual machines from linked data containers.

Some advantages of the “all in one flat VM” approach seem to be that we can construct puppet files to build particular components and then compose the final machine configuration from a single Vagrant script pulling in those separate components. Whilst this works when developing a VM for use by students on their own machines, it perhaps makes less sense if we were to provide remote hosted access to student VMs. There is an overhead associated with running a VM which needs to be taken into account if you need to scale. In terms of help desk support, the all-in-one VM approach offers a couple of switch it off and switch it on again opportunities: a machine can be shutdown and restarted/reprovisioned, or if necessary can be deleted and reinstalled though this latter loses any state that was saved internally in the VM by the student). If a particular application in the VM needs shutting down and restarting, then a specific stop/start instruction is required for each application.

On the other hand, a docker route in which each virtual application is launched inside its own container, and those containers are then linked together to provide the desired student VM configuration, means that if an application needs to be restarted, we can just destroy the container and fire up a replacement (though we’d probably need to find ways of preserving – or deleting – state associated with a particular application container too). If applications run as services, and for example I have a notebook server connected to a database server, if I destroy the database server container, I should be able to point the notebook server to the new database server – if I know the address of the new database server…

After a bit of searching around, I came across an example of creating a configuration not too dissimilar from the TM351 virtual machine configuration, but built from linked containers: Using Docker for data science, part 2 [Calvin Giles]. The machine is constructed from several containers, wired together using this fig script:

notebooks:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/notebooks:/notebooks"
data:
    command: echo created
    image: busybox
    volumes:
        - "~/Google Drive/data:/data"

devpostgresdata:
    command: echo created
    image: busybox
    volumes: 
        - /var/lib/postgresql/data

devpostgres:
    environment:
        - POSTGRES_PASSWORD
    image: postgres
    ports:
        - "5433:5432"
    volumes_from:
        - devpostgresdata

notebook:
    environment:
        - PASSWORD
    image: calvingiles/data-science-environment
    links:
        - devpostgres:postgres
    ports:
        - "443:8888"
    volumes_from:
        - notebooks
        - data

(WordPress code plugin & editor sucking atm wrt the way it keeps trying to escape stuff…)

(Fig is a tool for building multiple docker containers and wiring them together, a scripted version of something like Panamax. The main analysis application – calvingiles/data-science-environment – is a slight extension of ipython/scipyserver.)

With fig and boot2docker installed, and the fig script downloaded into the current working directory:

curl -L https://gist.githubusercontent.com/calvingiles/b6123c301954fe68e29a/raw/data-science-environment-fig.yml > fig.yml

the following two lines of code make sure that any previous copies of the containers are purged, and a new set of containers fired up with the desired password settings:

fig rm
PASSWORD=MyPass POSTGRES_PASSWORD=PGPass fig up -d

(The script also creates Google Drive folders into which copies of the notebooks will be located and shared between the VM containers and the host.)

The notebooks can then be accessed via browser, (you need to log in with the specified password – MyPass from the example above); the location of the notebooks is https//IP.ADDRESS:443 (note the https, which may require you saying “yes, really load the page” to Google Chrome – though it is possible to configure the server to use just http) where IP.ADDRESS can be found by running boot2docker ip.

One thing I had trouble with at first was connecting the IPython notebook to the PostgreSQL database server (I couldn’t see it on localhost). I found I needed to connect to the actual IP address within the VM of the database container.

I found this address (IPADDRESS) from the docker commandline using: fig run devpostgres env (where devpostgres is the name of the database server container). The port is the actual server port number rather than the forwarded port number:

import psycopg2
con = psycopg2.connect(host=IPADDRESS,port=5432,user='postgres',password='PGPass')

I also came up with a workaround (as described in this issue I raised) but this seems messy to me – there must be a better way? Note how we connect to the forwarded port:

#Via http://blog.michaelhamrah.com/2014/06/accessing-the-docker-host-server-within-a-container/
#Get the IP address of the docker host server inside the VM
# I assume this is like a sort of 'localhost' for the space in which the containers float around?
IPADDRESS=!netstat -nr | grep '^0\.0\.0\.0' | awk '{print $2}'

#Let's see if we can connect to the db using the forwarded port address
import psycopg2
con = psycopg2.connect(host=IPADDRESS[0],port='5433',user='postgres', password='PGPass')

#Alternativley, connect via SQL magic
!pip3 install ipython-sql
%load_ext sql
postgrescon = 'postgresql://postgres:PGPass@'+IPADDRESS[0]+':5433'

#Then cell magic via:
%%sql $postgrescon

This addressing fiddliness also raises an issue about how we would run container bundles for several students in the same VM under a hosted offering – how would any particular student know how to connect to “their” personal database server(s). [UPDATE: doh! Container linking passes name information into a container as an environmental variable: Linking Containers Together.] Would we also need to put firewall rules in place to partition the internal VM network so that a student could only see other containers from their bundle? And in the event of switch-it-off/destroy-it/start-it-up-again actions, how would any new firewall rules and communication of where to find things be managed? Or am I overcomplicating?!

Anyway – this is a candidate way for constructing a VM out of containers in an automated way. So what are the relative merits, pros, cons etc of using the vagrant/puppet/all-in-one-VM approach as opposed to the containerised approach in an educational context? Or indeed, for different education contexts (eg trad-uni computer lab with VMs running in student desktops (if that’s allowed!); distance education student working on their home machine with a locally hosted VM; multiple students connecting to VM configurations hosted on a local cluster, or on AWS/Google Cloud etc?

Any comments – please add them below… I am sooooooo out of my depth in all this!

PS Seems I can connect with con = psycopg2.connect(host='POSTGRES',port='5432',user='postgres', password="PGPass")

Check other environments with:

import os
os.environ

Edtech and IPython Notebooks – Activities and Answer Reveals

A few months ago I posted about an interaction style that I’d been using – and that stopped working – in IPython notebooks: answer reveals.

An issue I raised on the relevant git account account turned up a solution that I’ve finally gotten round to trying out – and extending with a little bit of styling. I’ve also reused chunks from another extension (read only code cells) to help style other sorts of cell.

Before showing where I’m at with the notebooks, here’s where OU online course materials are at the moment.

Teaching text is delivered via the VLE (have a look on OpenLearn for lots of examples). Activities are distinguished from “reading” text by use of a coloured background.

m269_closedAns

The activity requires a student to do something, and then a hidden discussion or anser can be revealed so that the student can check their answer.

m269_ansReveal

(It’s easy to be lazy, of course, and just click the button without really participating in the activity. In print materials, a frictional overhead was added by having answers in the back of the study guide that you would have to turn to. I often wonder whether we need a bit more friction in the browser based material, perhaps a time based one where the button can’t be clicked for x seconds after the button is first seen in the viewport (eg triggered using a test like this jQuery isOnScreen plugin)?!)

I drew on this design style to support the following UI in an IPython notebook:

Here’s a markdown cell in activity style and activity answer style.

M269_-_Python_-_Blue1a

The lighter blue background set into the activity is an invitation for students to type something into those cells. The code cell is identified as such by the code line In [ ] label. Whatever students type into those cells can be executed.

The heading is styled from a div element:

M269_-_Python_-_Blue

If we wanted to a slightly different header background style as in the browser materials, we could perhaps select a notebook heading style and then colour the background differently right across the width of the page. (Bah.. should have thought of that earlier!;-)

Markdown cells can also be styled to prompt students to make a text response (i.e. a response written in markdown, though I guess we could also use raw text cells). I don’t yet have a feeling for how much ownership students will take of notebooks and start to treat them as workbooks?

Answer reveal buttons can also be added in:

M269_-_Python_-_Blue2

Clicking on the Answer button displays the answer.

M269_-_Python_-_Blue3

At the moment, the answer background is the same colour as the invitation for student’s to type something, although I guess we could also interpret as part of the tutor-alongside dialogue, and the lighter signifies expected dialogic responses whether from the student or the “tutor” (i.e. the notebook author).

We might also want to make use of answer buttons after a code completion activity. I haven’t figured out the best way to do this as of yet.

M269_-_Python_-_Blue4

At the moment, the answer button only reveals text – and even then the text needs to be styled as HTML (the markdown parsing doesn’t work:-(

M269_-_Python_-_Blue5

I guess one approach might be to spawn a new code cell containing some code written in to the answer button div. Another might be to populate a code cell following the answer button with the answer code, hiding the cell and disabling it (so it can’t be executed/run), then revealing it when the answer button is clicked? I’m also not sure whether answer code should be runnable or not?

The mechanic for setting the cell state is currently a little clunky. There are two extensions, one for the answer button, one for setting the state other cells, that use different techniques for addressing the cells (and that really need to be rationalised). The extensions run styling automatically when the notebook is started, or when triggered. At the moment, I’m using icons from the orgininal code I “borrowed” – which aren’t ideal!

M269_-_Python_-_Blue0

The cell state setter button toggles selected code cells from activity to not-activity states, and markdown cells from activity-description to activity-student-answer to not-activity. The answer button button adds an answer button at every answer div (even if there’s already an answer button rendered). Both extensions style/annotate restarted notebooks correctly.

The current, hacky, user model I have in mind is that authors have an extended notebook with buttons to set the cell styles, and students have an extended notebook without buttons that just styles the notebook when it’s opened.

FWIW, here’s the gist containing extensions code.

Comments/thoughts appreciated…

Assessing Data Wrangling Skills – Conversations With Data

By chance, a couple of days ago I stumbled across a spreadsheet summarising awarded PFI contracts as of 2013 (private finance initiative projects, 2013 summary data).

The spreadsheet has 800 or so rows, and a multitude of columns, some of which are essentially grouped (although multi-level, hierarchical headings are not explicitly used) – columns relating to (estimated) spend in different tax years, for example, or the equity partners involved in a particular project.

As part of the OU course we’re currently developing on “data”, we’re exploring an assessment framework based on students running small data investigations, documented using IPython notebooks, in which we will expect students to demonstrate a range of technical and analytical skills, as well as some understanding of data structures and shapes, data management technology and policy choices, and so on. In support of this, I’ve been trying to put together one or two notebooks a week over the course of a few hours to see what sorts of thing might be possible, tractable and appropriate for inclusion in such a data investigation report within the time constraints allowed.

To this end, the Quick Look at UK PFI Contracts Data notebook demonstrates some of the ways in which we might learn something about the data contained within the PFI summary data spreadsheet. The first thing to note is that it’s not really a data investigation: there is no specific question I set out to ask, and no specific theme I set out to explore. It’s more exploratory (that is, rambling!) than that. It has more of the form what I’ve started referring to as a conversation with data.

In a conversation with data, we develop an understanding of what the data may have to say, along with a set of tools (that is, recipes, or even functions) that allow us to talk to it more easily. These tools and recipes are reusable within the context of the data set or datasets selected for the conversation, and may be transferrable to other data conversations. For example, one recipe might be how to filter and summarise a dataset in a particular way, or generate a particular view or reshaping of it that makes it easy to ask a particular sort of question, or generate a particular sort of a chart (creating a chart is like asking a question where the surface answer is returned in a graphical form).

If we are going to use IPython notebook documented conversations with data as part of an assessment process, we need to tighten up a little more how we might expect to see them structured and what we might expect to see contained within them.

Quickly skimming through the PFI conversation, elements of the following skills are demonstrated:

– downloading a dataset from a remote location;
– loading it into an appropriate data representation;
– cleaning (and type casting) the data;
– spotting possibly erroneous data;
– subsetting/filtering the data by row and column, including the use of partial string matching;
– generating multiple views over the data;
– reshaping the data;
– using grouping as the basis for the generation of summary data;
– sorting the data;
– joining different data views;
– generating graphical charts from derived data using “native” matplotlib charts and a separate charting library (ggplot);
– generating text based interpretations of the data (“data2text” or “data textualisation”).

The notebook also demonstrates some elements of reflection about how the data might be used. For example, it might be used in association with other datasets to help people keep track of the performance of facilities funded under PFI contracts.

Note: the notebook I link to above does not include any database management system elements. As such, it represents elements that might be appropriate for inclusion in a report from the first third of our data course – which covers basic principles of data wrangling using pandas as the dominant toolkit. Conversations held later in the course should also demonstrate how to get data into an out of an appropriately chosen database management system, for example.

Anscombe’s Quartet – IPython Notebook

Anyone who’s seen one of my talks that even touches on data and visualisation will probably know how it like to use Anscombe’s Quartet as a demonstration of why it makes sense to look at data, as well as to illustrate the notion of a macroscope, albeit one applied to a case of N=all where all is small…

Some time ago I posted a small R demo – The Visual Difference – R and Anscombe’s Quartet. For the new OU course I’m working on (TM351 – “The Data Course”), our focus is on using IPython Notebooks. And as there’s a chunk in the course about dataviz, I feel more or less obliged to bring Anscombe’s Quartet in:-)

As we’re still finding our way about how to make use of IPython Notebooks as part of an online distance education course, I’m keen to collect feedback on some of the ways we’re considering using the notebooks.

The Anscombe’s Quartet notebook has quite a simple design – we’re essentially just using the cells as computed reveals – but I’m still be keen to hear any comments about how well folk think it might work as a piece of standalone teaching material, particularly in a distance education setting.

The notebook itself is on github (ou-tm351), along with sample data, and a preview of the unexecuted notebook can be viewed on nbviewer: Anscombe’s Quartet – IPython Notebook.

Just by the by, the notebook also demonstrates the use of pandas for reshaping the dataset (as well as linking out to a demonstration of how to reshape the data using OpenRefine) and the ŷhat ggplot python library (docs, code) for visualising the dataset.

Please feel free to post comments here or as issues on the github repo.

Confused Again About VM Ecology… I Blame Not Blogging

Via a cc’d tweet from Martin Hawksey, this lovely post from Tom Smith/@everythingabili (who has the best ever twitter bio strapline) on How I Learn ( And What I’m Learning ).

I like to think that I used to write blog posts that had the same sort of sense as that post…

…but for the last few months at least, I don’t think I have.

“Working” for once – starting production on an OU course (TM351, due out October 2015 (sic; I’m gonna be late on the first draft of the 7 weeks of the course I’m responsible for: it’s due in a little over a fortnight…), and also helping out on an investigative project the School of Data is partnering on – has meant that most of the learnings and failings that I used to blog about have been turned inward to emails (which are even more poorly structured in terms of note-taking than this blog is) if at all.

Which is a shame and makes me not happy.

Reading through completed academic papers, making useful (I hope) use of them in the course draft, has been quite fun in part – and made me regret at times not writing up work of my own in a self-contained, peer reviewed way over the last decade or so; getting stuff “into the record”, properly citable, and coherent enough to actually be cited. But as you pick away at the papers, you know they’re a revisionist telling, a straightforward narrative of how the pieces fit together and in which nothing went wrong along the way; (you also know that the closer you get to trying to replicate a paper, the harder it is to find the missing pieces (process, trick, equation or insight) that actually make it work; remember school maths, or physics, and the textbook that goes from one equation to the next with a “hence”, but there’s no way in hell you can figure out how to make that step and you know you’ll be stuck when that bit comes up in the exam…?! That. Or worse. When you bang your head against a wall trying to get something to work, contort your mental models to make it work, sort of, then see the errata list correcting that thing. That, too.)

On the other hand, this blog is not coherent, shapes no whole, but is full of hence steps. Disjointed ones, admittedly. But they keep a track of all the bits I tried at and failed at and worked around, and they keep on getting me out of holes… Like the emails won’t. Lost. Wasted effort because the learning fumblings that are OUseful learning fumblings are lost and locked up in email hell.

It makes me very not happy.

So that, by way of intro, to this: a quick catchup follow-up to Cursory Thoughts on Virtual Machines in Distance Education Courses and Doodling With IPython Notebooks for Education, a partial remembering of the various shades of hell associated with them and trying to share them.

Here’s what I think I now want to do (whether or not it’s the right thing I’m not sure).

  • generate a script that will build a VM. We’ve opted for Virtualbox as the VM container. The VM will need to contain: pandas; IPython notebook (course team want it to run Python 3.3. I’ve lost track of how many hours I’ve spent trying and failing to get Python libraries I think we need trying to run under Python 3.3; wasted effort; I should have settled with Python 2.7 and then revisited 3.3 in several months hence; the 2.7 3.3 tweaks to any code we write for the course should manageable in migration terms. Pratting around on libraries that I’m guessing will get patched as native distributions move to 3.3 by default but don’t work yet is wasted effort. October. 2015. First presentation.); PostgreSQL (perhaps with some extensions); mongodb; ipythonblocks; blockdiag; I came across shellinabox today and wonder if we should run that; OpenRefine (CT against this – I think it’s good for developing mental models); python-nvd3; folium; a ggplot port to python; (CT take – too much new stuff; my take, we should go as high up the stack as we can in terms of the grammar of the calling functions); I think we should run R and RStudio too to make for a really useful VM, making the point that the language doesn’t matter and we use whatever’s appropriate to get things done, but I don’t think anyone else does. if. Which computer language is that from then? for. Or that one? How about in? print? Cars are for driving. Mine’s blue. I have no idea how it works. Can I drive your car? The red one. With the left-hand drive.
  • access the services running on the headless VM via a browser on host. I think we should talk to the databases using Python, but I may be in the minority. We may need something more graphical to talk to postgresql. If we do, I would argue it should be a browser based client – if it’s an app, we’re moving functionality provision outside of the VM.
  • use the script to build to machines with the same package config; CT seem to prefer a single build on a 32 bit machine. I think we should support 64 bit as well. And deployment on at least one cloud service – I;d go for Amazon, but that’s mainly because it’s the only one I’ve tried. If we could support more, even better.
  • as far as maintenance goes, I wrote the vagrant script to update libraries whenever the provisioner is run (which is quite a lot at the mo as I keep finding new things to add to the box!;-) This may or may not be sensible for student use. If there is a bug in a library, an update could help. If there is a security patch to the kernel, we should be updating as good citizens. The current plan is to ship a built box (which I think would have to go on to a USB stick – we can’t rely on folk having computers with a DVD any more, and a 1.5GB download seems to be proving unreliable without a proper download manager. As it is, students will have to download virtualbox and vagrant, and install those themselves. (Unless we can put installers for them on a USB stick too.) If we do ship a built box, we need to think of some way of kickstarting the services and perhaps rebooting the machine (and then kickstarting the services). There is a separate question of whether we should be also be able to update config scripts during presentation. This would presumably have to be done on the host. One way might be to put config scripts on a git server then use git to keep the config scripts on the students’ host machine up to date, but that would probably also require them to install a git commandline tool, even if we automated the rest. Did I say this all has to run cross platform? Students may be on Windows (likely?), Mac or Linux. I think the course should be doable, if painfully, via a tablet, which means the VM needs the cloud hosted option. If we could also find a way to help students configure their whatever platform host so that they could access services from the VM running on it via their tablet, so much the better.
  • files need to be shared between VM and host. This raises an interesting issue for a cloud hosted VM. Either, we need to find a way to synch files between desktop and cloud VM, persist state on the cloud host so that the VM can synch to it, or pop dropbox into the cloud VM (though there would then be a synch delay, as there would with a desktop synch). I favour persisting on the cloud, though there is then the question of the student who is working on a home machine one day and a cloud machine the next.
  • Starting and stopping services: students need to be able to start and stop services running on the VM without having to ssh in. Once click easy. A dashboard with buttons that show if a service is running or not, click a button to toggle the run state of the the service. No idea how to do this.

Here’s the approach I’ve taken:

  • reusing DataminerUK’s infinite-interns model as a starting point, I’m using vagrant to build and provision a VM using puppet. At the moment I have duplicate setups for two different Linux base boxes (precise64 and raring64. The plan is to move to the latest Ubuntu LTS.) I really should have a single setup with the different machine setups called by name from a single Vagrantfile. I think.
  • The puppet provisioner builds the box from a minimal base and starts the services running. It’s aggressive on updates. The precise64 box is running python 2.7 and the raring64 box 3.3. Getting pip3 running in the raring box was a pain, and I don’t know how to tell puppet to use the pip3 thing I eventually installed to update. At the moment I fudge with:
    exec { "pip3-update":
    command => "/usr/local/bin/pip3 install --upgrade pip"
    }

    but it breaks because I’m not convinced that is always the right path (I’d like to hedge on /usr/bin:/usr/local/bin), or that pip3 is actually installed when I try to exec it… I think what I really want to do is something like
    package {
    [
    JUST UPGRADE YOURSELF, PLEASE
    ]: ensure => latest,
    provider => 'pip3';
    }

    with an additional dependency check (=>) that pip3 has been installed first, and from all the other pip3 installs that pip3 has been upgraded first.
  • The IPython notebook is started by a config shell script called from puppet. I think I’m also using a config script to set up a user and test tables in Postgres (though I plan to move to the puppet extension as soon as I can get round to reading the docs/finding out how to do it).
  • There are times when a server needs restarting. At the moment I have to run vagrant provision to do that – or even vagrant halt;vagrant up, which means it also runs the updates. It’d be nice to just be able to run the bits that restart the services (the DBMS’, IPython notebook etc) without doing any of the package updates, installs, checks etc.
  • We need a tool to check whether services are running on the necessary ports to help debugging without requiring a user to SSH into the VM; I’ve also fixed on default ports. We really need to change ports if a default port is being used to a free port and then somehow tell the IPython notebook scripts which port each service is running on. With vagrant letting you run a VM from within a particular directory, being able to know what VMs are being run and from where, wherever vagrant on host started them, would be useful.
  • I don’t use a configurator for the postgres db (it needs seeding with some example tables) but should do – on my to do list is to look at https://github.com/puppetlabs/puppetlabs-postgresql . Similarly for mongo db – and perhaps https://github.com/puppetlabs/puppetlabs-mongodb
  • To make use of python-nvd3, suggested route is to use bower. I got the npm package manager to work but have failed to find a way of installing any actual packages [issue].

Issues to date, aside from things like port clashes and all manner of f**k ups because I distributed a README with an error in it and folk tried to follow it rather than patches posted elsewhere, have been impeded by not having a good way of logging and sharing errors. OU network issues have also contributed to the fun. I always avoid the OU staff network, but nothing seems to work on that. I suspect this is a proxy issue, but I’m unwilling to invest any time exploring it or how to config the VM to cope (no-one else has offered to go down this rat hole). Poxy proxies could well be an issue for some students, but I’m not sure what the best way forward is. Cloud hosted VMs?!

I also had a problem on OU eduroam – mongodb wants to get something signed from a keyserver before it will install mongodb, but OU eduroam (the network I always use) seems to block the keyserver. Tracking that down wasted a good hour.

Here are some other things I’ve heard about:

https://github.com/psychemedia/notebookcloud This is cloned from https://github.com/carlsmith who appears to have taken his repo – and the app – down? It provided a dashboard for firing up notebook servers on Amazon cloud. If I hadn’t been ‘working’ I’d have blogged screenshots and the workflow. As it is, all I have are vague memories of how it worked and what it did and the ideas that sprung off of having an artefact to talk around. [Hmm… app seems to have come back up – maybe I caught it at a bad time… https://notebookcloud.appspot.com/login ]

– provisioning things: chef, vagrant, puppet, docker.

What should I be using for what?

I thought about different VMs for different services, but that adds too much VM weight and requires networking between the VMs, we could lead to “support issues”. Would docker help here? A base VM built from vagrant and puppet, then docker to put additional machines on top.

What I want is students to be able to:

– install minimum number of third party apps on their desktop (currently virtualbox and vagrant)
– click one button get their VM. My feeling is we should have a largely prebuilt box on a USB stick they can use as a base box, then do a top up build and provision. I suspect CT would like one click somewhere to fire up a machine, get services running, and open a tab to the IPython notebook in their browser, maybe a status button somewhere, a single button to restart any services that have stopped and options to suspend or shutdown machines. In terms of student workflow, I think suspending and resuming machines (if services can resume appropriately) would be the neatest workflow. Note: the course runs over 9 months…
– be able to access files on host that are used in the VM. If they are using multiple VMs (eg on cloud and desktop) to find a sensible way of synching notebooks and data/database state across those machines; which could be tricky at least as far as database state goes.
– if a student is not using postgresql or mongo – and they won’t for the first 8 weeks of the course – it could be useful to run the VM without those services running (perhaps aside from a quick self-test in week 1 so we can check out any issues as early as possible and give ourselves a week or two to come up with any patches before those apps are used in anger). So maybe a control panel to fire up the VM and the services you want to run. Yes mongo, no postgresql. No DB at all. And so on. Would docker help here? Decide on host somehow which services to run, fire up the VM, then load in and start up the required services. Next session, change which services are running in the VM?

All in all, I reckon I’m between 20 and 40% there (further along in terms of time?) but not sure how much further I can push it to the desired level of robustness without learning how to do this stuff a bit more properly… I’m also not really sure what’s practically and reliably possible and what’s best for what. We have to maximise the robustness of stuff ‘just working’ and minimise support issues because students are at a distance and we don’t know what platform they’re on. I think if I started from scratch and rewrote the scripts now they’d be a lot clearer, but I know that’d take half a day and the functional return – for now – I think would be minimal.

That said, I’ve done a fair amount of learning, though large chunks of it have been lost to email and not blogging. Which is a shame. Must do better. Must take public notes.