Moving PostgreSQL and MongoDB Data Directories to User Home

In the online hosted computing lab we now use to provide a hosted solution for module based docker containers, persistent storage is mounted into a launched container at /home/ou/MODULECODE-PRESENTATIONCODE. For modules that use databases that students might modify, this means we need to move the database data directories into the users home directory somewhere.

Here’s a copy of the script I’ve been dabbling with to try to do that. It runs when the container is started, after the home directory volume has been mounted:

#! /bin/bash

# Location: /etc/ou_local_db_path/local_db_path.sh

LOCAL_HOME=/home/ou/TM351-23J

mkdir -p $LOCAL_HOME/openrefine
chown ou:users $LOCAL_HOME/openrefine

# If required, don't run any more of this script
if [ -f "$LOCAL_HOME/.no_local_db_path" ]; then
    echo "Not running local db scripts"
    exit 0
fi

## DATABASE MIGRATION

# The Mongo and Postgres data directories can be mounted 
# to the persistent storage that is provided by the Open Computing Lab.
#
# Migration happens if:
#
# - /home/ou/MODULE-PRESENTATION/.no_mount does not exist
# - /home/ou/MODULE-PRESENTATION/.local_DBTYPE does not exist (for DBTYPE postgres, mongo)
#
# The .no_mount file will be clobbered if a volume is mounted over /home/ou/MODULE-PRESENTATION/ ;
# this means databases *will* be copied when the container is run in the VCE.
# Databases *will* be copied to host for locally run containers, unless the user
# adds a .no_mount file to the directory they mount onto /home/ou/MODULE-PRESENTATION/

# See for example:
# https://github.com/OpenComputingLab/vce-jupyter-stacks/blob/main/tm351-notebook-jh/start_jh_extras

# Create a db hidden storage dir
mkdir -p $LOCAL_HOME/.db

# Migrate postgres db to local userdir
if [[ ! -f "$LOCAL_HOME/.local_postgres" && ! -f "$LOCAL_HOME/.no_mount" ]] ; then
    echo "Copying over postgres database files to $LOCAL_HOME/.db/"

    service postgresql stop

    # We need to give the postgres user sight into the users...
    usermod -aG users postgres

    # Recursive copy, preserving permissions
    cp -Rp /var/lib/postgresql /home/ou/TM351-23J/.db/

    # Manual settings
    #chown -R postgres:postgres $LOCAL_HOME/.db/postgresql
    #chmod -R 700 $LOCAL_HOME/.db/postgresql/

    sed -e "s@[#]\?data_directory = .*@data_directory = '/home/ou/TM351-23J/.db/postgresql/14/main'@g" -i '/etc/postgresql/14/main/postgresql.conf'
 
    touch $LOCAL_HOME/.local_postgres
    chown ou:users $LOCAL_HOME/.local_postgres

    service postgresql restart
fi

#if [ ! -d "$LOCALMONGO" ]; then
if [[ ! -f "$LOCAL_HOME/.local_mongo"  && ! -f "$LOCAL_HOME/.no_mount" ]]; then
    echo "Copying over mongo database files to $LOCAL_HOME/.db/"

    LOCALMONGO="$LOCAL_HOME/.db/mongo/"

    # mongo data directory migration
    if [ ! -f "/var/run/mongodb.pid" ]; then
        service mongod stop
    fi

    mkdir -p $LOCALMONGO
    # Recursive copy, preserving permissions
    cp -Rp /var/db/data/mongo $LOCAL_HOME/.db

    sed -e "s@[#]\?dbPath: .*@dbPath: /home/ou/TM351-23J/.db/mongo@g" -i '/etc/mongod.conf'

    # Check permissions
    #chmod -R u+rw $LOCALMONGO

    touch $LOCAL_HOME/.local_mongo
    chown ou:users $LOCAL_HOME/.local_mongo
    
    service mongod restart
fi

The script is run using sudo, with permissions set as follows in the image build script:

echo "ou ALL=(ALL:ALL) NOPASSWD: /etc/ou_local_db_path/local_db_path.sh" >> /etc/sudoers

An empty .local_postgres and.local_mongo file are used to flag whether or not to copy the files over for each database; a separate .no_mount file can also be used to disable moving the databases. The .no_mount file is created as part of the image build:

touch /home/ou/TM351-23J/.no_mount
chown ou:users /home/ou/TM351-23J/.no_mount

If persistent volume is mounted in, the .no_mount file will be clobbered (unless it is also in the directory that is mounted in); if no volume is mounted in, the data directory is not moved from its default location.

The approach used above is wasteful of space — the original data directories as shipped are copied not moved, but this means we can recover the original seeded database if required.

Deconstructing the TM351 Virtual Computing Environment via VS Code

For 2020J, which is to say, the 2020 October presentation, of our TM351 Data Management and Analysis course, we’ve deprecated the original VirtualBox packaged virtual machine and moved to a monolithic Docker container that packages all the required software applications and services (a Jupyer notebook server, postgres and mongoDB database servers, and OpenRefine).

As with the VM, the container is headless and exposes applications over http via browser based user interfaces. We also rebranded away from “TM351 VM” to “TM351 VCE”, where VCE stands for Virtual Computing Environment.

Once Docker is installed, the environment is installed and launched purely from the command line using a docker run command. Students early in to the forums have suggested moving to docker compose, which simplifies the command line command significantly, but also at the cost of having to supply a docker-compose.yaml . With OU workflows, it can take weeks, if not months, to get files onto the VLE for the first time, and days to weeks to post updates (along with a host of news announcements and internal strife about the possibility of tutors/ALs and students having different versions of the file). As we need to support cross-platfrom operation, and as the startup command specifies file paths for volume mounts, we’d need different docker-compose files (I think?) because file paths on Mac/Linux hosts, versus Windows hosts, use a different file path syntax (forward vs back slashes as path delimiters. [If anyone can tell me how to write a docker-compose.yaml files with arbitrary paths on the volume mounts, please let me know via the comments…]

Something else that has cropped up early in the forums is mention of VS Code, which presents a way to personalise the way in which the course materials are used.

By default, the course materials we provide for practical activities are all based on Jupyter notebooks, delivered via the Jupyter notebook server in the VCE (or via an OU hosted notebook server we are also exploring this year). The activities are essentially inlined using notebook code cells within a notebook that presents a linear teaching text narrative.

Students access the notebooks via their web browser, wherever the notebook server is situated. For students running the Docker VCE, notebook files (and OpenRefine project files) exist in a directory on the student’s own computer that is then mounted into the container; make changes to the notebooks in the container and those changes are saved in the notebooks mounted from host. Delete the container, and the notebooks are still on your desktop. For students using the online hosted notebook server, there is no way of synchronising files back to the student desktop, as far as I am aware; there was an opportunity to explore how we might allow students to use something like private Github repositories to persist their files in a space they control, but to my knowledge that has not been explored (a missed opportunity, to my mind…).

Using the VS Code Python extension, students installing VS Code on their own computer can connect to the Jupyter server running in the containerised VCE and (I don’t know if the permissions allow this on the hosted server).

The following tm351vce.code-workspace file describes the required settings:

{
"folders": [
{
"path": "."
}
],
"settings": {
"python.dataScience.jupyterServerURI": "http://localhost:35180/?token=letmein"
}
}

The VSCode Python extension renders notebooks, so students can open local copies of files from their own desktop and execute code cells against the containerised kernel. If permissions on the hosted Jupyter service allow remote/external connections, this would provide a workaround for synching notebooks files: students would work with notebook files saved on their own computer but executed against the hosted server kernel.

Queries can be run against the database servers via the code cells in the normal way (we use some magic to support this for the postgres database).

If we make some minor tweaks to the config files for the PostgreSQL and MongoDB database servers, we can use the VS Code PostgreSQL extension and MongoDB extension to run queries from VS Code directly against the databases.

For example, the postgres database:

image

and the mongo database:

image

Note that this is now outside the narrative context of the notebooks, although it strikes me that we could generate .sql and .json text files from notebooks that show code literally and comment out the narrative text (the markdown text in the notebooks).

However, we wouldn’t be able to work directly with the data returned from the database via Python/pandas dataframes, as we do in the notebook case. (Note also that in the notebooks we use a Python API for querying the mongo database, rather than directly issuing Javascript based queries.)

At this point you might ask why we would want to deconstruct / decompose the original structured notebook+notebook UI environment and allow students to use VS Code to access the computational environment, not least when we are in the process of updating the notebooks and the notebook environment to use extensions that add additional style and features to the user environment. Several reasons come to my mind that are motivated by finding ways in which we can essentially lose control, as educators, of the user interface whilst still being reasonably confident that the computational environment will continue to perform as we intend (this stance will probably make many of my colleagues shudder; I call it supporting personalisation…):

  • we want students to take ownership of their computational environment; this includes being able to access it from their own clients that may be better suited to their needs, eg in terms of familiarity, accessibility, productivity, etc;
  • a lot of our students are already working in software development and already have toolchains they are working with. Whilst we see benefits of using the notebook UI from a teaching and learning perspective, the fact remains that students can also complete the activities in other user environments. We should not hinder them from using their own environments — the code should still continue to run in the same way — as long as we explain how the experience may not be the same as the one we are providing, and also noting that some of the graphics / extensions we use in the notebooks may not work in the same way, or may not even work at all, in the VS Code environment.

If students encounter issues when using their own environment, rather than the one we provide, we can’t offer support. If the personalised learning environment is not as supportive for teaching and learning as the environment we provide, it is the student’s choice to use it. As with the Jupyter environment, the VS Code environment sits at the centre of a wide ecosystem of third party extesions. If we can make our materials available in that environment, particulalry for students already familiar with that environment, they may be able to help us by identifying and demonstrating new ways, perhaps even more effective ways, of using the VS Code tooling to support their learning than the enviorment we provide. (One example might be the better support VS Code has for code linting and debugging, which are things we don’t teach, and that our chosen environment perhaps even prevents students who know how to use such tools from making use of them. Of course, you could argue we are doing students a service by grounding them back in the basics where they have to do their own linting and print() statement debugging… Another might be the Live Share/collaboration service that lets two or more users work collaboratively in the same notebook, which might be useful for personal tutorial sessions etc.)

From my perspective, I believe that, over time, we should try to create materials that continue to work effectively to support both teaching and learning in environments that students may already be working in, and not just the user interface environments we provide, not least becuase we potentially increase the number of ways in which students can see how they might make use of those tools / environments.

PS I do note that there may be licensing related issues with VS Code and the VS Code extensions store, which are not as open as they could be; VSCodium perhaps provides a way around that.

Dockerising / Binderising the TM351 Virtual Machine

Just before the Chirstmas break, I had a go recasting the TM351 VM as a Docker container built from a Github repository using MyBinder (which is to say: I had a go at binderising the VM…). Long time readers will know that this virtual machine has been used to deliver a computing environment to students on the OU TM351 Data managament and Analysis course since 2016. The VM itself is built using Virtualbox provisioned using vagrant and then distributed originally via a mailed out USB stick or alternatively (which is to say, unofficially; though my preferrred route) as a download from VagrantCloud.

The original motivation for using Vagrant was a hope that we’d be able to use a range of provisioners to construct VM images for a range of virtualisation platforms, but that’s never happened. We still ship a Virtualbox image that causes problems to a small number of Windows users each year, rather than a native HyperV image, because: a) I refuse to buy a Windows machine so I can build the HyperV image myself; b) no-one else sees benefit from offering multiple images (perhaps because they don’t provide the tech support…).

For all our supposed industrial scale at delivering technology backed “solutions”, the VM is built, maintained and supported on a cottage industry basis from within the course team.

For a scaleable solution that would work:

a) within a module presentation;
b) across module presentations;
c) across modules

I think we should be looking at some sort of multi-user hosted service, with personal accounts and persistent user directories. There are various ways I can imagine delivering this, each that creates its own issues as well solving particular problems.

As a quick example, here are two possible extremes:

1) one JupyterHub server to rule them all: every presentation, every module, one server. JupyterHub can be configured to use the DockerSpawner to present different kernel container options to the user, (although I’m not sure if this can be personalised on a per user basis? If not, that feature would make for a useful contribution back…), so a student could be presented with a list of containers for each of their modules.

2) one JupyterHub server per module per presentation: this requires more admin and means servers everywhere, but it separates concerns…

The experimental work on a “persistent Binderhub deployment” also looks interesting, offering the possibility of launching arbitrary environments (as per Binderhub) against personally mounted file area (as per JupyterHub).

Providing a “takeaway” service is also one of my red lines: a student should be free to take away any computing environment we provide them with. One in-testing hosted version of the TM351 VM comes, I believe, with centralised Postgres and MongoDB servers that students have accounts on and must log in to. Providing a mutli-user service, rather than a self-contained personal server, raises certain issues regarding support, but also denies the student the ability to take away the database service and use it for their own academic, personal or even work purposes. A fundamentally wrong approach, in my opinion. It’s just not open.

So… binderising the VM…

When Docker containers first made their appearance, best practice seemed to be to have one service per container, and then wire containers together using docker-compose to provide a more elaborate environment. I have experimented in the past with decoupling the TM351 services into separate containers and then launching them using docker-compose, but it;s never really gone anywhere…

In the communities of practice that I come across, more emphasis now seems to be on putting everything into a single container. Binderhub is also limited to launching a single container (I don’t think there is a Jupyter docker-compose provisioner yet?) so that pretty much seals it… All in one…

A proof-of-concept Binderised version of the TM351 VM can be found here: innovationOUtside/tm351vm-binder.

It currently includes:

  • an OU branded Jupyter notebook server running jupyter-server-proxy;
  • the TM351 Python environment;
  • an OpenRefine server proxied using jupyter-server-proxy;
  • a Postgres server seeded (I think? Did I do that yet?!) with the TM351 test db (if I haven’t set it up as per the VM, the code is there that shows how to do it…);
  • a MongoDB server serving the small accidents dataset that appears in the TM351 VM.

What is not included:

  • the sharded Mongo DB activity; (the activity it relates to as presented at the moment is largely pointless, IMHO; we could deminstrate the sharding behaviour with small datasets, and if we did want to provided queries over the large dataset, that might make sense as something we host centrally and et students log in to query. Which would also give us another teachng point.)

The Binder configuration is provided in the binder/ directory. An Anaconda binder/environment.yml file is used to install packages that are complicated to build or install otherwise, such as Postgres.

The binder/postBuild file is run as a shell script responsible for:

  • configuring the Postgres server and seeding its test database;
  • installing and seeding the MongoDB database;
  • installing OpenRefine;
  • installing Python packages from binder/requirements.txt (the requirements.txt is not otherwise automatically handled by Binderhub — it is trumped by the environment.yml file);
  • enabling required Jupyter extensions.

If any files handled via postBuild need to be persisted, they can be written into $CONDA_DIR.

(As a reference, I have also created some simple standalone template repos showing how to configure Postgres and MongoDB in Binderhub/repo2docker environments. There’s also a neo4j demo template too.)

The binder/start file is responsible for:

  • defining environment variables and paths required at runtime;
  • starting the PostgreSQL and MongoDB database services.

(OpenRefine is started by the user from the notebook server homepage or JupyterLab. There’s a standalone OpenRefine demo repo too…)

Launching the repo using MyBinder will build the TM351 environment (if a Binder image does not already exist) and start the required services. The repo can also be used to build an environment locally using repo2docker.

As well as building a docker image within the Binderhub context, the repo is also automated with a Github Action that is used to build release commits using repo2docker and then push the resulting container to Docker Hub. The action can be found in the .github/workflows directory. The container can be found as ousefuldemos/tm351-binderised:latest. When running a container derived from this image, the Jupyter notebook server runs on the default port 8888 inside the container, and the OpenRefine application proxied through it; the database services should autostart. The notebook server is started with a token required, so you need to spot the token from the start up logs of the container – which means you shouldn’t run it with the -d flag. A variant of the following command should work (I’m not sure how you reliably specify the correct $PWD (present working directory) mount directory from a Windows command prompt):

docker run --name tm351test --rm -p 8895:8888 -v $PWD/notebooks:/notebooks -v $PWD/openrefine_projects:/openrefine_projects ousefuldemos/tm351-binderised:latest

Probably easier is to use the Kitematic inspired containds “personal Binderhub” app which can capture and handle the token auomatically and let you click straight through into the running notebook server. Either use containds to build the image locally by providing the repo URL, or select a new image and search for tm351: the ousefuldemos/tm351-binderised image is the one you want. When prompted, select the “standard” launch route, NOT the ‘Try to start Jupyter notebook’ route.

Although I’ve yet to try it (I ran out of time before the break), I’m hopeful that the prebuilt container should work okay with JupyterHub. If it does, this means the innovationOUtside/tm351vm-binder repo can serve as a template for building images that can be used to deploy authenticated OU computing environments via an OU authenticated and OU hosted JupyterHub server (one can but remain hopeful!).

If you try out the environment, either using MyBinder, via repo2docker, or from the pre-built Docker image, please let me know either here, via the repo issues, or howsoever: a) whether it worked; b) whether it didn’t; c) whether there were any (other) issues. Any and all feedback would be much appreciated…

Getting the TM351 VM Running on OU OpenStack

One of the original motivations for delivering the TM351 software and services via a virtual machine, with user interfaces provided via a browser, was that we should be able to use the same VM  as a locally run machine on a student’s own computer, or as a hosted machine (accessible via the web) running on an OU server.

A complementary third year equivalent course, TM352 Web, Mobile and Cloud Technologies, uses a Faculty managed OpenStack instance as a dogfooding teaching environment on that course (students learn about cloud stuff in the course, get to deploy some canned machines and develop their own services using OpenStack, and the department develops skills in in deploying and managing such environments with hundreds of users).

I think part of the pitch for the OpenStack cluster was that it would be available to other courses, but a certain level of twitchiness in keeping it stable for the original course use case has meant that getting access to the machine has not been as easy as it might have been.

(There is no dev server that I can access, at least not from a connection outside the OU network. So the only server I can play on is the live server, as used by students. If you’re confident managing OpenStack, this is probably fine (it should be able to cope with lots of tenants with different requirements, right?), but if you’re not, making a dev server, open to all who want to try it out, and available sooner rather than later, probably makes morse sense: more people solving problems, more use cases being explored and ruled out, more issues being debugged; more learning going on generally…)

Whatever.

I’ve finally got an account, and a copy of the TM351 VM image, originally built for VirtualBox, uploaded to it.

You’d think that part at least would have been be easy, but it took the best part of four months or so at least… First, getting an account on the OpenStack server. Second, getting a copy of the TM351 VM image that could be loaded onto it. I got stuck going nowhere trying to convert the original Virtualbox image until it was pointed out to me that there was a VirtualBoxManage tool for doing it (Converting between VM Formats). Faculty advice suggests the clonehd command:

vboxmanage clonehd box-disk001.vmdk /Users/USER/Desktop/tm351.img --format raw

but that looks deprecated in recent versions of VirtualBox to me… The following seems more contemporary:

VBoxManage clonemedium ~/VirtualBox\ VMs/tm351_18J-student/box-disk001.vmdk tm351_18J-student.raw --format RAW

Third, loading the image onto OpenStack. A raw box format image I thought I had managed to create myself came in at 64GB (the original box was ~8GB), but it seems this is because that’s the size of the virtual disk. Presumably vagrant is setting this in my original build (or VirtualBox is defaulting to it?), so one thing I need to figure out is how to reduce it without compromising anything. Looking at Resizing Vagrant box disk space  I wonder if we could move along steps from vmdk to vid to resize and then raw?

Uploading a 64 GB from home to OpenStack using an http file uploader on  the OpenStack user admin page is just asking for trouble, but even copying the image from OU networked machines is not just-do-it-able: it requires copying  the file from one machine to another and then onto the OpenStack server by someone-not-me with the appropriate logins and scp permissions.

(Building the machine on OpenStack myself using an OpenStack vagrant provisioner is not an option on the live server at least: API access addresses seem to only be provided for a private network that I don’t have access to. If we manage to get a development server that I am allowed to access using VPN, or even better, without VPN, and I can get permissions to use the API, and we can connect to things like the apt-get and Pypi/pip repos, using a build provisioner makes sense to me.)

So I there is now an image visible on the OpenStack server.

You’ll note we haven’t tried to brand the OpenStack user’s admin panel at all  (I would have…;-).

What next? Trying to spin up an instance from the image kept giving me errors (I started trying with a small machine instance, then tried creating an instance with ever larger machine flavours — the issue was indeed the 64GB default disk size associated with the image. Faculty IT changed a setting that meant the larger disk sizes would spin up and reported that it worked for them with the VM on a large flavour machine.

But it didn’t for me…  I kept getting the message [Error: Exceeded maximum number of retries. Exhausted all hosts available for retrying build failures for instance XXXXXX. I think the issue must have been a permissions thing manifested as a network thing. Faculty IT restarted the image as private to me, (and with my own private network?) and I tried again… (For this reason, I’m not convinced that anyone else just given an account will be able to get their own version of the TM351 VM up and running? I need to understand better what requirements, if any, are placed on the creation of the OpenStack user account for it to work. And I need a second test user account (at least) to test it..)

Anyway – success for me – a running instance of the TM351 VM. And now I could use the OpenStack web console to log in to the machine using the default vagrant credentials. Which I need to change… (and find a sensible method for students to use to change the defaults).

So now I can poke around inside the VM. But I can’t see any of the services it’s running for a couple of reasons: firstly, the VM has no public IP address; secondly, the only port I think I’m allowed to expose publicly is port 80, and there are no services running on port 80. And unlike vagrant and docker, which make it easy to map and expose an arbitrary port inside the VM onto a specified port outside the VM, such as port 80, I haven’t found a way to do that in OpenStack. (The documentation sucks. Really badly. And there is no internal FAQ to give me even the slightest crib as to what to do next.)

The TM352 course materials come to my rescue here, sort of. As OU central academic staff, I can log in to course VLEs and see the published teaching material, although not the student forums. Looking in the current presentation, the materials that show TM352 students how to make their VM visible to the world haven’t been released yet so I can’t see them.. Bah… But I can look at the materials provided to students on the previous presentation… Which are out of date compared to the current version of OpenStack. But never mind, because the materials are enough of a crib to figure out what to do where-ish: Block 2 Part 2: Designing a cloud, 8 Getting started with OpenStack. The essential steps boil down to the following (apols for the vagueness; I don’t want to actually restep through everything to check it works in case I break my current instance; next time I run through from scratch, I’ll tidy up the instructions. Ideally, I’d do a fresh run through in a new, virgin test user account):

  1. Create a new private network for the VM to run on: I seemed to have a network already created, but here’s a howto: under Network, select the Networks option, and then Create Network with the  Admin State as UP (i.e. running and usable) and the Create Subnet box ticked. Use IP/v4 and set an IP address range in CIDR format (e.g. 192.168.0.0/24);
  2. Create a router that interconnects the public network and the private network: from the Network menu select Routers . Set Admin State to UP and External Network to public then Create Router. In the Network Topology view, select the router and then Add Interface, using Subnet set to the private created network and the IP Address left blank.
  3. Configure the network security rules: from Network select Security Groups ; if there’s no default group create one; once there is, select Manage Rules. We need to add three rules:
Rule ALL TCP
Direction Ingress
Remote CIDR
CIDR 0.0.0.0/0
Rule ALL ICMP
Direction Ingress
Remote CIDR
CIDR 0.0.0.0/0
Rule HTTP
Remote CIDR
CIDR 0.0.0.0/0
  1. Create a VM instance from the TM351 image: bearing in mind the previous set-up, choose appropriately!
  2. Attach a public IP address to the VM: in `Network` select Floating IPs and then Allocate IP to Project. With the new floating IP address, select Associate and choose appropriately.

Hopefully now there should be a public IP address associated with the VM and ports 80 and 22 (ssh) exposed. Using the public IP address, from a terminal on my own local machine:

ssh vagrant@VM.IP.ADDR.ESS

followed by the password, and I should be in…

(I can’t help thinking that typing vagrant up is a much easier way to launch a VM. And then vagrant ssh to SSH in…)

Next step – try to see the public services running inside the VM, bearing in mind that we can only access services through port 80.

To test things, we can just try a simple http server on port 80:

python3 -m http.server 80

That works, so port 80 is live on my VM and I can see it from the public internet. So kill the test http server…

Running Everything Through Port 80

Running services inside VM against port 80 requires them to run as root (ports <1024 are privileged), but in the last rebuild of the VM we tried to move away from running everything as root and instead run them under a user account. Which means that the Jupyter server is defined to run under a user account on a non-privileged port.

I went round in circles on this one for getting on for an hour, trying to run Jupyter notebooks on port 80, but running into permissions errors accessing port 80 unless I ran the service as root.  (Things like tail /var/log/syslog helped in the debugging…)

I also had to manually fix the missing notebook directory that the notebook service is supposed to start in. (I think this is another permissions snafu – the service runs as a user but the mkdir guard run via ExecStartPre needs permissions tweaking to run as root using PermissionsStartOnly=true (issue.)

The simplest thing to do is run a proxy like nginx. Which isn’t installed in the VM. No problem, the vagrant user I ssh into the VM with can run via sudo so I should be able to just do a sudo apt-get update && sudo apt-get install -y nginx. Only I can’t because the security rules upstream of the OpenStack server won’t let me. F**k. It’s a Saturday afternoon, and there are zero, no, zilch, none, Faculty IT help files or FAQs that have been shared with me, or that I’m even aware of the existence of, with possible workarounds. But there is Twitter, and various other Saturday working friends, which gives me a result: set up an ssh tunnel and do it via my home machine ( https://stackoverflow.com/questions/36353955/apt-get-install-via-tunnel-proxy-but-ssh-only-from-client-side ):

sudo ssh -R 8899:us.archive.ubuntu.com:80 vagrant@IP.ADDR

With that tunnel set up, inside the VM I can run sudo nano /etc/apt/apt.conf and edit in the following lines:

Acquire::http::Proxy "http://localhost:8899";
Acquire::https::Proxy "https://localhost:8899";

Then I can apt-get update, apt-get install etc inside the VM

sudo apt-get update
sudo apt-get install -y nginx

To try and pre-empt any other issues, it’s worth checking that the required folders (again) are in place (/vagrant/notebooks and /vagrant/openrefine-projects) and with the appropriate owner and group (oustudent:users) permissions:

sudo chown -R oustudent:users /vagrant

As mentioned, the current ExecPreStart in the Jupyter notebook and OpenRefine service definition files were supposed to check folders exist but I think they need changing to incorporate things like following:

PermissionsStartOnly=true
ExecStartPre=/bin/mkdir -p /vagrant/notebooks
ExecStartPre=/bin/chown oustudent:users /vagrant/notebooks

Right, so permissions should be sorted, and the Jupyter notebook server should be runnable against port 80 via the nginx proxy; but I need an nginx config file… If we were running notebooks as a service in the OU this is the sort of thing I’d hope would be in an an examples FAQ, battle tested in an OU context; but we don’t so it isn’t so I rely on other people having solved the problem and being willing to share their answer in public: https://nathan.vertile.com/blog/2017/12/07/run-jupyter-notebook-behind-a-nginx-reverse-proxy-subpath/

Unfortunately, it didn’t work for me out of the can… the post supposedly describes how to proxy the server down a path, but (jumping ahead) the login page URL didn’t rewrite down the path for me; tweaking the proxy definition so that the Jupyter notebook server runs at the top level (/) on port 80 did work though – so here’s the nginx definition file I ended up using:

sudo nano /etc/nginx/sites-available/default

and then:

location / {
  error_page 403 = @proxy_groot;

  deny 127.0.0.1;
  allow all;

  # set a webroot, if there is one
  #root /web_root;
  try_files $uri @proxy_groot;
}

location @proxy_groot {
  #rewrite /notebooks(.*) $1 break;
  proxy_read_timeout 300s;
  proxy_pass http://upstream_groot;

  # pass some extra stuff to the backend
  proxy_set_header Host $host;
  proxy_set_header X-Real-Ip $remote_addr;
  proxy_set_header X-Forwarded-For $proxy_add_x_forwarded_for;
}

location ~ /api/kernels/ {
  proxy_pass http://upstream_groot;
  proxy_set_header Host $host;

  # websocket support
  proxy_http_version 1.1;
  proxy_set_header Upgrade "websocket";
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 86400;
}

location ~ /terminals/ {
  proxy_pass http://upstream_groot;
  proxy_set_header Host $host;

  # websocket support
  proxy_http_version 1.1;
  proxy_set_header Upgrade "websocket";
  proxy_set_header Connection "Upgrade";
  proxy_read_timeout 86400;
}

followed by:

sudo  nginx -s reload

To try to make the notebook server slightly more secure than wide open — it will be running on a public IP address after all — I need to add a password (the original TM351 VM runs everything wide open).

First, create a password hash:

echo -n "my cool password" | sha1sum

then edit the system service file:

sudo nano /lib/systemd/system/jupyter.service

We need to tweak the startup along the lines of:

ExecStart=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --notebook-dir=/vagrant/notebooks --allow-root --NotebookApp.token='' --NotebookApp.password='sha1:WHATEVER' --allow_origin='*'

We can probably drop the --allow-root ? (Although the default notebook user can sudo some commands…)

Reload the daemon to acknowledge the service definition changes and restart the service:

sudo systemctl daemon-reload
sudo systemctl restart jupyter.service

So this seems to work: I can see Jupyter notebook and login via port 80 on the floating public IP address I assigned to the TM351VM instance. I can open a notebook, run cells, call the PostgreSQL and basic Mongo databases at least, open a terminal. What I can’t do is curl or wget or run Python requests to load data files from the internet using a notebook because of the upstream IT network security rules. This is a bit of a blocker for the course. We may be able to finesse a way round with an ssh tunnel in testing, but I don’t think we should be expecting that of our students. (Thinks: how do IT security rules / policies apply when we define activities for students that we expect them to run on their own computers?! File as: whatever… We’ll just have to do something really crappy instead for students. Or set up a best-not-tell-IT proxy on the OU network somewhere…)

The next step is – can I expose the other core teaching application in the VM: OpenRefine?

A possible blocker is that we only have one port exposed on the public internet (port 80) so we need to find a way to expose OpenRefine. Fortunately, the nbserverproxy package allows the Jupyter server to proxy services running on localhost in the VM. So I should be able to run that. But first things first:  pip installs are borked even with an ssh tunnel (open questions on Stack Overflow confirm that this is not just me…).

Okay… pip packages can be downloaded and installed from a local file, so I can download the nbserverproxy pip package on my own machine and then scp it into the running OpenStack hosted VM at /vagrant/notebooks . Then from a notebook inside the VM I can run !pip install --user ./nbserverproxy-master.zip (just to show the notebook is working properly! ;-)  and enable it: ! jupyter serverextension enable --py nbserverproxy.

Restart the notebook server from VM command line and I should be able to see OpenRefine at http://MY.FLOATING.IP.ADDR/proxy/3334/ (the trailing slash is required of the styling fails as the path to the style files is incorrectly resolved). I think that this should also be down the password protected path? i.e. if I hadn’t logged in to the notebook server, I don’t think I should be able to get this far? (NEED TO CHECK.)

One of the VM Easter Eggs, nbdime, is also visible on http://MY.FLOATING.IP.ADDR/proxy/8899/. Go team me… :-)

Grab a snapshot of the working VM in the idle hope that maybe if someone else tries to launch from that image, it will just work. Although things like the network and security rules will presumably need setting up?

For student use, I’d need a simple way / recipe to set up different/personalised ssh credentials into the VM, otherwise anyone with the public IP address could ssh in. This must be a common issue, so it’d be good to see a Faculty OpenStack FAQ suggesting what the possible options are. I guess a simple one is on starting the instance? Can we force keys into the VM when it launches? Another issue is (re)setting the password for the Jupyer notebook server so each student is assigned, or can easily set (and recover….) their own password.

Other next steps: is there something in OpenStack where I can define network settings, security rules, etc, and provide students with an easier way of deploying an TM351 instance on the Faculty OpenStack and making its public services available on the public internet? Can I do this with an OpenStack stack? If so, that would be a handy thing to have an OU OpenStack tutorial for…

This is obvs the sort of support that should be available in Faculty IT tutorials, FAQs, and God Forbid, in person if we’re running the OpenStack server as a Faculty service and trying to encourage people to use it, so that’s what I’ll probably spend my next day of miserable OpenStack hacking doing when I can motivate myself to do it: trying to figure out if and how to make things closer to one click simpler for students to launch their own TM351 VM. (In the first instance for TM351, we want students to be able to run course VMs on an OU server because they’re struggling with getting things running on their own computer; this is often highly correlated with them having poor computer skills, poor problem solving skills, and poor instruction following skills, so we’re on a hiding to nothing if we expect them to launch instances, choose flavours, create routers, create and assign floating IP addresses and set up security rules. On their own. Because I’m not going to do that tech support for them. (I am ranty typing; my keyboard is suddenly VERY LOUD. [REDACTED])

Fragment – Running Multiple Services, such as Jupyter Notebooks and a Postgres Database, in a Single Docker Container

Over the last couple of days, I’ve been fettling the build scripts for the TM351 VM, which typically uses vagrant to build a VirtualBox VM from a set of shell scripts, so they can be used to build a single Docker container that runs all the TM351 services, specifically Jupyter notebooks, OpenRefine, PostgreSQL and MongoDB.

Docker containers are typically constructed to a run a single service, with compositions of containers wired together using Docker Compose to create applications that deliver, or rely on, more than one running service. For example, in a previous post (Setting up a Containerised Desktop API server (MySQL + Apache / PHP 5) for the ergast Motor Racing Data API) I showed how to set up a couple of containers to work together, one running a MySQL database server, the other an http service that provided an API to the database.

So how to run multiple services in the same container? Docs on the Docker website suggest using supervisord to run multiple services in a single container, so here’s a fragment on how I’ve done that from my TM351 build.

To begin with, I’ve built the container up as a tiered set of containers, in a similar way to the way the stack of opinionated Jupyter notebook Docker containers are constructed:

#Define a stub to identify the images in this image stack
IMAGESTUB=psychemedia/tm361testm

# minimal
## Define a minimal container, eg a basic Linux container
## using whatever flavour of Linux we prefer
docker build --rm -t ${IMAGESTUB}-minimal-test ./minimal

# base
## The base container installs core packages
## The intention is to define a common build environment
## populated with packages likely to be common to many courses
docker build --rm --build-arg BASE=${IMAGESTUB}-minimal-test -t ${IMAGESTUB}-base-test ./base

#...

One of the things I’ve done to try to generalise the build steps is allow the name a base container to be used to bootstrap a new one by passing the name of the base image in via an optional variable (in the above case, --build-arg BASE=${IMAGESTUB}-minimal-test). Each Dockerfile in a build step directory uses the following construction to work out which image to use as the FROM basis:

#Set ARG values using --build-arg =
#Each ARG value can also have a default value
ARG BASE=psychemedia/ou-tm351-base-test
FROM ${BASE}

Using the same approach, I have used separate build tiers for the following components:

  • jupyter base: minimal Jupyter notebook install;
  • jupyter custom: add some customisation onto a pre-existing Jupyter notebook install;
  • openrefine: add the OpenRefine application; (note, we could just use BASE=ubuntu to create this a simple, standalone OpenRefine container);
  • postgres: create a seeded PostgreSQL database; note, this could be split into two: a base postgres tier and then a customisation that adds users, creates and seed databases etc;
  • mongodb: add in a seeded mongo database; again, the seeding could be added as an extra tier on a minimal database tier;
  • topup: a tier to add in anything I’ve missed without having to go back to rebuild from an earlier step…

The intention behind splitting out these tiers is that we might want to have a battle hardened OU postgres tier, for example, that could be shared between different courses. Alternatively, we might want to have tiers offering customisations for specific presentations of a course, whilst reusing several other fixed tiers intended to last out the life of the course.

By the by, it can be quite handy to poke inside an image once you’ve created it to check that everything is in the right place:

#Explore inside animage by entering it with a shell command
docker run -it --entrypoint=/bin/bash psychemedia/ou-tm351-jupyter-base-test -i

Once the services are in place, I add a final layer to the container that ensures supervisord is available and set up with an appropriate supervisord.conf configuration file:

##Dockerfile
#Final tier Dockerfile
ARG BASE=psychemedia/testpieces
FROM ${BASE}

USER root
RUN apt-get update && apt-get install -y supervisor

RUN mkdir -p /openrefine_projects  && chown oustudent:100 /openrefine_projects
VOLUME /openrefine_projects

RUN mkdir -p /notebooks  && chown oustudent:100 /notebooks
VOLUME /notebooks

RUN mkdir -p /var/log/supervisor
COPY monolithic_container_supervisord.conf /etc/supervisor/conf.d/supervisord.conf

EXPOSE 3334
EXPOSE 8888

CMD ["/usr/bin/supervisord"]

The supervisord.conf file is defined as follows:

##supervisord.conf
##We can check running processes under supervisord with: supervisorctl

[supervisord]
nodaemon=true
logfile=/dev/stdout
loglevel=trace
logfile_maxbytes=0
#The HOME envt needs setting to the correct USER
#otherwise jupyter throws: [Errno 13] Permission denied: '/root/.local'
#https://github.com/jupyter/notebook/issues/1719
environment=HOME=/home/oustudent

[program:jupyternotebook]
#Note the auth is a bit ropey on this atm!
command=/usr/local/bin/jupyter notebook --port=8888 --ip=0.0.0.0 --y --log-level=WARN --no-browser --allow-root --NotebookApp.password= --NotebookApp.token=
#The directory we want to start in
#(replaces jupyter notebook parameter: --notebook-dir=/notebooks)
directory=/notebooks
autostart=true
autorestart=true
startsecs=5
user=oustudent
stdout_logfile=NONE
stderr_logfile=NONE

[program:postgresql]
command=/usr/lib/postgresql/9.5/bin/postgres -D /var/lib/postgresql/9.5/main -c config_file=/etc/postgresql/9.5/main/postgresql.conf
user=postgres
autostart=true
autorestart=true
startsecs=5

[program:mongodb]
command=/usr/bin/mongod --dbpath=/var/lib/mongodb --port=27351
user=mongodb
autostart=true
autorestart=true
startsecs=5

[program:openrefine]
command=/opt/openrefine-3.0-beta/refine -p 3334 -i 0.0.0.0 -d /vagrant/openrefine_projects
user=oustudent
autostart=true
autorestart=true
startsecs=5
stdout_logfile=NONE
stderr_logfile=NONE

One thing I need to do better is to find a way to stage the construction of the supervisord.conf file, bearing in mind that multiple tiers may relate to the same servicel for example, I have a jupyter-base tier to create a minimal Jupyter notebook server and then a jupyter-base-custom tier that adds in specific customisations, such as branding and course related notebook extensions.

When the final container is built, the supervisord command is run and the multiple services started.

One other thing to note: we’re hoping to run TM351 environments on an internal OpenStack cluster. The current cluster only allows students to expose a single port, and port 80 at that, from the VM (IP addresses are in scant supply, and network security lockdowns are in place all over the place). The current VM exposes at least two http services: Jupyter notebooks and OpenRefine, so we need a proxy in place if we are to expose them both via a single port. Helpfully, the nbserverproxy Jupyter extension (as described in Exposing Multiple Services Via a Single http Port Using Jupyter nbserverproxy), allows us to do just that. One thing to note, though – I had to enable it via the same user that launches the notebook server in the suoervisord.conf settings:

##Dockerfile fragment

RUN $PIP install nbserverproxy

USER oustudent
RUN jupyter serverextension enable --py nbserverproxy
USER root

To run the VM, I can call something like:

docker run -p 8899:8888 -d psychemedia/tm351dockermonotest

and then to access the additional services, I can browse to e.g. localhost:8899/proxy/3334/ to see the OpenRefine application.

PS in case you’re wondering why I syndicated this through RBloggers too, the same recipe will work if you’re using Jupyter notebooks with an R kernel, rather than the default IPython one.

Seven Ways of Making Use of SQLite

SQLite is a really handy file based database engine. When you put data into a database, it can make it easier to search; it also provides a great tool for looking for stories or story leads hidden in the data. So here are seven ways of getting started with SQLite.

Querying SQLite Databases Using Rich User Interfaces

Whilst SQLite comes with it’s own command line client, many people will find the easiest way of getting started with querying SQLite databases is to use an application, either on the desktop or accessed via a browser based UI.

I think this project has stalled; not sure if it still works? -> franchise is a browser based UI that you can access via the cloud or run locally (code). If you have a SQLite database file (make sure the suffix is .sql) you can upload it and explore it using the franchise application. (If you have a CSV or Excel data file, you can upload that too and it will add it to its own temporary SQLite database). Here’s a review: Asking Questions of CSV Data, Using SQL In the Browser, With Franchise.

In the browser, you can try the official sql.js demo GUI or SQLite Viewer.

If you prefer something on the desktop, the cross-platform sqlitebrowser [repo] might suit your needs.

Another great way of making use of SQLite is bring it alive using datasette. A single command allows you to publish an interactive, browser based user interface to one or more databases, either on your own computer or via an online host such as Zeit Now, or Heroku. For example, I’ve popped up three databases I scrape together on Heroku and pointed my own URL at them (unfortunately, I tend to run out of free Heroku credits in the last week of the month at which point the site goes down!)

datasette allows you to query the databases through a customisable webpage and supports a range of plugins. For example, the datasette-cluster-map will detect latitude and longitude columns in a datatable and present the results using an interactive map. I gave it a spin with this map of UK Food Hygiene Ratings. You can find several other examples of datasettes published in the wild on the datasette wiki.

Finding Data: SQLite Databases in the Wild

Whilst tools such as datasette are great for quickly getting started with querying a SQLite database, one obvious question that arises is: what SQLite database?

Once you start poking around, however, you can start to find examples of SQLite databases working quietly behind the scenes on you own computer. (Searching your computer for files with a .sqlite suffix is one way of finding them!)

As a specific example, the Safari, Chrome and Firefox web browsers all keep track of your browser history using a SQLite database on your computer (this gist – dropmeaword/browser_history.md – tells you where you can find the files. You can then simply query them using datasette. On a Mac, I can simply run:

datasette ~/Library/Application\ Support/Google/Chrome/Default/History

and I can then start to query my browser history using the datasette browser based UI. Here’s an ‘inspect your browser history’ tutorial to get you started…

Ingesting Data

Applications such as franchise allow you to upload a CSV or Excel datafile and automatically import it into a SQLite database so that it can be queried using SQLite. The datasette ecosystem also includes and application for uploading CSV files and ingesting them into a SQlite database: datasette publish.

Behind the scenes of that application is a Python command line utility called csvs-to-sqlite. A simple command lets yoiu convert a CSV file to a SQLite database:

csvs-to-sqlite myfile.csv mydatabase.db

Whilst csvs-to-sqlite focusses on the conversion of CSV files into a SQLite database, the more general sqlitebiter command line utility can convert CSV, Excel, HTML tables (eg from a URL), JSON, Jupyter Notebooks, Markdown, TSV and Google-Sheets to a SQLite database file using a similar command format.

Using SQLite from the Commandline

Natively, SQLite comes with its own command line shell that allows you to connect to and query a SQLite database from the command line.

As well as command line tools for converting data contained in various file formats into a SQLite database, several command line tools embed that conversion within a command line interface that allows you convert a data file to an in-memory SQLite database and query it directly.

For example, using the csvsql command from csvkit:

csvsql --query "select * from iris as i join irismeta as m on (i.species = m.species)" examples/iris.csv examples/irismeta.csv

Or textql:

Or the simply named q:

q -H "SELECT COUNT(DISTINCT(uuid)) FROM ./clicks.csv"

Querying SQLite Databases from Programming Languages Such as Python and R

If you are developing your own data-centric reproducible research pipelines, it’s likely that you will be using a programming language such as R or the Python pandas library.

There are two main ways for using SQLite in this context. The first is to connect to the database from the programming language and then query it from within the language. For example, in R, you might use the RSQLite package. In Python, you can connect to a SQLite database using the base sqlite3 package.

The other approach is to use SQLite as an in-memory database that provides a SQL query interface to a dataframe. In R, this can be achieved using the sqldf package:

library(sqldf)
sqldf("select * from iris limit 5")

In Python/pandas, you can use the pandasql package:

from pandasql import sqldf, load_births
pysqldf = lambda q: sqldf(q, globals())
births = load_births()
print(pysqldf("SELECT * FROM births LIMIT 10;").head())

In many respects, sqldf and pandasql behave like programming language native versions of command-line utilities such as csvsql, textql and q, although rather than importing a data file into a SQLite database so that it can be queried, they import the contents of the referenced dataframe.

pandas also provides native support for adding dataframes as tables to a connected SQLite database, as well as reading results from queries onto the database back into a dataframe. Once you start getting into the swing of putting data into a database, and then running JOINed queries over multiple tables, you’ll start to wonder why you spent so much time getting hassled by VLOOKUP. As an example, here’s a way of Making a Simple Database to Act as a Lookup for the ONS Register of Geographic Codes.

Using SQLite in Your Own Applications

If you are comfortable with using programming code to manipulate your data, then you may want to explore ways of using SQLite to create your own data driven applications.

One way to start is to use SQLite completely within the browser. Accessing desktop applications from a webpage is typically a no-no because of browser security restrictions, but SQLite is quite a light application, so it can – and has been – compiled to Javascript so that it can be imported as a Javascript library and run from within a webpage: sql.js. The aforementioned sql.js GUI and SQLite Viewer both use sql.js under the hood…

As well as running SQLite in a browser, SQLite can also be used to power an API. One of the advantages of running a datasette service is that it also exposes a datasette API. This means you can publish your datasette to a web host then other computers can querying it programmatically.

If you are working in a Python Jupyter environment, it’s simple enough to use the Jupyer kernel gateway to create your own APIs. Here’s an example of building a service to allow the lookup of ONS codes from a simple SQLite database: Building a JSON API Using Jupyter Notebooks in Under 5 Minutes.

Another way of using SQLite databases in a Jupyter environment is to use Scripted Forms to For example, here’s one of my own recipes for Creating Simple Interactive Forms Using Python + Markdown Using ScriptedForms + Jupyter that shows how to create a simple interactive form for querying a SQLite database containing descriptions of images used in OpenLearn courses.

SQLite Database Admin Tools

As well as providing a simple explorer and query interface, the sqlitebrowser tool also supports a range of other SQLite database administration functions such as the ability to create, define, modify and delete tables and indexes, or edit, add and delete individual records.

The browser based sqlite-web application provides a similar range of utilities via a browser based, rather than desktop client, UI.

Summary

SQLite is lightweight, in-memory and file based database that allows you to run SQL queries over a wide range of tabular datasets. If you work with data, knowing how to write even simple SQL queries can add a powerful tool to your toolbox. SQLite, and the associated tools created around it, is almost certainly one of the easiest ways in to using this most versatile, portable, and personal data power tool.

PS Here are some more ways of using SQLite:

Seeding Shared Folders With Files Distributed via a VM

For the first few presentations of our Data Management and Analysis course, the course VM has been distributed to students via a USB mailing. This year, I’m trying to move to a model whereby the primary distribution is via a download from VagrantCloud (students manage the VM using Vagrant), though we’re also hoping to be able to offer access to an OU OpenStack hosted VM to any student’s who really need it.

For students on Microsoft Windows computers, an installer installs Virtualbox and vagrant from installers distributed via the USB memory stick. This in part derives from the policy of fixing versions of as much as we can so that it can be tested in advance. The installer also creates a working directory for the course that will be shared by the VM, and copies required files, again from the memory stick, into the shared folder. On Macs and Linux, students have to do this setup themselves.

One of the things I have consciouslystarted trying to do is move the responsibility for satisficing of some of the installation requirements into the Vagrantfile. (I’m also starting to think they should be pushed even deeper into the VM itself.)

For example, as some of the VM services expect particular directories to exist in the shared directory, we have a couple of defensive measures in place:

  • the Vagrantfile creates any required, yet missing, subdirectories in the shared directory;
            #Make sure that any required directories are created
            config.vm.provision :shell, :inline => <<-SH
                mkdir -p /vagrant/notebooks
                mkdir -p /vagrant/openrefine_projects
                mkdir -p /vagrant/logs
                mkdir -p /vagrant/data
                mkdir -p /vagrant/utilities
                mkdir -p /vagrant/backups
                mkdir -p /vagrant/backups/postgres-backup/
                mkdir -p /vagrant/backups/mongo-backup/	
            SH
    

  • start up scripts for services that require particular directories check they exist before they are started and create them if they are missing. For example, in the service file, go defensive with something like ExecStartPre=mkdir -p /vagrant/notebooks.

The teaching material associated with the (contents of) the VM is distributed using a set of notebooks downloaded from the VLE. Part of the reason for this is that it delays the point at which the course notebooks must be frozen: the USB is mastered late July/early August for a mailing in September and course start in October.

As well as the course notebooks are a couple of informal installation test notebooks. This can be frozen along with the VM and distributed inside it, but the question then arises. So this year I am trying out a simple pattern that bakes test files into the VM and then uses the Vagranfile to copy the files into the shared directory on its first run with a particular shared folder:

config.vm.provision :shell, :inline => <<-SH
    if [ ! -f /vagrant/.firstrun_nbcopy.done ]; then
        # Trust notebooks in immediate child directories of notebook directory
        files=(`find /opt/notebooks/* -maxdepth 2 -name "*.ipynb"`)
        if [ ${#files[@]} -gt 0 ]; then
            jupyter trust /opt/notebooks/*.ipynb;
            jupyter trust /opt/notebooks/*/*.ipynb;
        fi
        #Copy notebooks into shared directory
        cp -r /opt/notebooks/. /vagrant/notebooks
        touch /vagrant/.firstrun_nbcopy.done
    fi
   SH

This pattern allows files shipped inside the VM to be copied into the shared folder once it is mounted into the VM from host. The files will then persist inside the shared directory, along with a hidden flag file to say the files have been copied. I’m not sure about the benefits of auto-running something inside the VM to manage this copying? Or whether to check that a more recent copy of the files to be copied doesn’t already exist in the shared folder before copying on the first run in the folder?

Fragment – TM351 Services Architected for Online Access

When we put together the original  TM351 VM, we wanted a single, self-contained installable environment capable of running all the services required to complete the practical activities defined for the course. We also had a vision that the services should be capable of being accessed remotely.

With a bit of luck, we’ll have access to an OU OpenStack environment any time soon that will let us start experimenting with a remote / online VM provision, at least for a controlled number of students. But if we knew that a particular cohort of students were only ever going to access the services remotely, would a VM be the best solution?

For example, the services we run are:

  • Jupyter notebooks
  • OpenRefine
  • PostgreSQL
  • MongoDB

Jupyter notebooks could be served via a single Jupyter Hub instance, albeit with persistence enable on individual accounts so students could save their own notebooks.

Access to PostgreSQL could be provided via a single Postgres DB with students logging in under their own accounts and accessing their own schema.

Similarly – presumably? – for MongoDB (individual user accounts accessing individual databases). We might need to think about something different for the sharded Mongo activity, such as a containerised solution (which could also provide an opportunity to bring the network partitioning activity I started to sketch out way back when).

OpenRefine would require some sort of machinery to fire up an OpenRefine container on demand, perhaps with a linked persistent data volume. It would be nice if we could use Binderhub for that, or perhaps DIT4C style infrastructure…

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.

Whatever.

I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)