Tagged: TM351

Seeding Shared Folders With Files Distributed via a VM

For the first few presentations of our Data Management and Analysis course, the course VM has been distributed to students via a USB mailing. This year, I’m trying to move to a model whereby the primary distribution is via a download from VagrantCloud (students manage the VM using Vagrant), though we’re also hoping to be able to offer access to an OU OpenStack hosted VM to any student’s who really need it.

For students on Microsoft Windows computers, an installer installs Virtualbox and vagrant from installers distributed via the USB memory stick. This in part derives from the policy of fixing versions of as much as we can so that it can be tested in advance. The installer also creates a working directory for the course that will be shared by the VM, and copies required files, again from the memory stick, into the shared folder. On Macs and Linux, students have to do this setup themselves.

One of the things I have consciouslystarted trying to do is move the responsibility for satisficing of some of the installation requirements into the Vagrantfile. (I’m also starting to think they should be pushed even deeper into the VM itself.)

For example, as some of the VM services expect particular directories to exist in the shared directory, we have a couple of defensive measures in place:

  • the Vagrantfile creates any required, yet missing, subdirectories in the shared directory;
            #Make sure that any required directories are created
            config.vm.provision :shell, :inline => <<-SH
                mkdir -p /vagrant/notebooks
                mkdir -p /vagrant/openrefine_projects
                mkdir -p /vagrant/logs
                mkdir -p /vagrant/data
                mkdir -p /vagrant/utilities
                mkdir -p /vagrant/backups
                mkdir -p /vagrant/backups/postgres-backup/
                mkdir -p /vagrant/backups/mongo-backup/	
            SH
    

  • start up scripts for services that require particular directories check they exist before they are started and create them if they are missing. For example, in the service file, go defensive with something like ExecStartPre=mkdir -p /vagrant/notebooks.

The teaching material associated with the (contents of) the VM is distributed using a set of notebooks downloaded from the VLE. Part of the reason for this is that it delays the point at which the course notebooks must be frozen: the USB is mastered late July/early August for a mailing in September and course start in October.

As well as the course notebooks are a couple of informal installation test notebooks. This can be frozen along with the VM and distributed inside it, but the question then arises. So this year I am trying out a simple pattern that bakes test files into the VM and then uses the Vagranfile to copy the files into the shared directory on its first run with a particular shared folder:

config.vm.provision :shell, :inline => <<-SH
    if [ ! -f /vagrant/.firstrun_nbcopy.done ]; then
        # Trust notebooks in immediate child directories of notebook directory
        files=(`find /opt/notebooks/* -maxdepth 2 -name "*.ipynb"`)
        if [ ${#files[@]} -gt 0 ]; then
            jupyter trust /opt/notebooks/*.ipynb;
            jupyter trust /opt/notebooks/*/*.ipynb;
        fi
        #Copy notebooks into shared directory
        cp -r /opt/notebooks/. /vagrant/notebooks
        touch /vagrant/.firstrun_nbcopy.done
    fi
   SH

This pattern allows files shipped inside the VM to be copied into the shared folder once it is mounted into the VM from host. The files will then persist inside the shared directory, along with a hidden flag file to say the files have been copied. I’m not sure about the benefits of auto-running something inside the VM to manage this copying? Or whether to check that a more recent copy of the files to be copied doesn’t already exist in the shared folder before copying on the first run in the folder?

Fragment – TM351 Services Architected for Online Access

When we put together the original  TM351 VM, we wanted a single, self-contained installable environment capable of running all the services required to complete the practical activities defined for the course. We also had a vision that the services should be capable of being accessed remotely.

With a bit of luck, we’ll have access to an OU OpenStack environment any time soon that will let us start experimenting with a remote / online VM provision, at least for a controlled number of students. But if we knew that a particular cohort of students were only ever going to access the services remotely, would a VM be the best solution?

For example, the services we run are:

  • Jupyter notebooks
  • OpenRefine
  • PostgreSQL
  • MongoDB

Jupyter notebooks could be served via a single Jupyter Hub instance, albeit with persistence enable on individual accounts so students could save their own notebooks.

Access to PostgreSQL could be provided via a single Postgres DB with students logging in under their own accounts and accessing their own schema.

Similarly – presumably? – for MongoDB (individual user accounts accessing individual databases). We might need to think about something different for the sharded Mongo activity, such as a containerised solution (which could also provide an opportunity to bring the network partitioning activity I started to sketch out way back when).

OpenRefine would require some sort of machinery to fire up an OpenRefine container on demand, perhaps with a linked persistent data volume. It would be nice if we could use Binderhub for that, or perhaps DIT4C style infrastructure…

Sharing Folders into VMs on Different Machines Using Dropbox, Google Drive, Microsoft OneDrive etc

Ever since I joined the OU, I’ve believed in trying to deliver distance education courses in an agile and responsive way, which is to say: making stuff up for students whilst the course is in presentation.

This is generally not done (by course/module teams at least) because the aim of most course/module teams is to prepare the course so thoroughly that it can “just” be presented to students.

Whatever.

I personally think we should try to improve the student experience of the course as it presents if we can by being responsive and reactive to student questions and issues.

So… TM351, the data management course that uses a VM, has started again, and issues / questions are already starting to hit the forums.

One of the questions – which I’d half noted but never really thought through in previous presentations (my not iterating/improving the course experience in, or between, previous presentations)  – related to sharing Jupyter notebooks across different machines using Google Drive (equally, Dropbox or Microsoft OneDrive).

The VirtualBox VM we use is fired up using the vagrant provisioner. A Vagrantfile defines various configuration settings – which ports are exposed by the VM, for example. By default, the contents of the folder in which vagrant is started up in are shared into the VM. At the same time, vagrant creates a hidden .vagrant folder that contains state relating to the instance of that VM.

The set up on a single machine is something like this:

If a student wants to work across several machines, they need to share their working course files (Jupyter notebooks, and so on) but not the VM machine state. Which is to say, they need a set up more like the following:

For students working across several machines, it thus makes sense to have all project files in one folder and a separate .vagrant settings folder on each separate machine.

Checking the vagrant docs, it seems as if this is quite manageable using the synced folder configuration settings.

The default copies the current project folder (containing the vagrantfile and from which vagrant is rum), which I’m guessing is a setting something like:

config.vm.synced_folder "./", "/vagrant"

By explicitly setting this parameter, we can decide how we want the mapping to occur. For example:

config.vm.synced_folder "/PATH/ON/HOST", "/vagrant"

allows you to to specify the folder you want to share into the VM. Note that the /PATH/ON/HOST folder needs to be created before trying to share it.

To put the new shared directory into effect, reload and reprovision the VM. For example:

vagrant reload --provision

Student notebooks located in the notebooks folder of that shared directory should now be available in the VM. Furthermore, if the shared folder is itself in a webshared folder (for example, a synced Dropbox, Google Drive or Microsoft OneDrive folder) it should be available wherever that folder is synched to.

For example, on a Mac (where ~ is an alias to my home directory), I can create a directory in my dropbox folder ~/Dropbox/TM351VMshare and then map this into the VM using by adding the following line to the Vagrantfile:

config.vm.synced_folder "~/Dropbox/TM351VMshare", "/vagrant"

Note the possibility of slight confusion – the shared folder will not now be the folder from which vagrant is run (unless the folder are running from is /PATH/ON/HOST ).

Furthermore, the only thing that needs to be in the folder from which vagrant is run is the Vagrantfile and the hidden .vagrant folder that vagrant creates.

Fingers crossed this recipe works…;-)

First Attempt at Running the TM351 VM as an AMI on Amazon Web Services

One of the things that’s been on my to do list for ages is trying to get a version of the TM351 virtual machine (VM) up and running on Amazon Web Services (AWS) as an Amazon Machine Instance (AMI). This would allow students who are having trouble running the VM on their own computer to access the services running in the cloud.

(Obviously, it would be preferable if we could offer such a service via OU operated servers, but I can’t do politics well enough, and don’t have the mentality to attend enough of the necessary say-the-same-thing-again-again meetings, to make that sort of thing happen.)

So… a first attempt is up on the eu-west-1 region in all its insecure glory: TM351 AMI v1. The security model is by obscurity as much as anything – there’s no model for setting separate passwords for separate students, for example, or checking back agains an OU auth layer. And I suspect everything runs as root…

(One of the things we have noticed in (brief) testing is that the Getting Started instructions don’t work inside the OU, at least if you try to limit access to your (supposed) IP address. Reminds of when we gave up trying to build the OU VM from machines on the OU network because solving proxy and blocked port issues was an irrelevant problem to have to worry about when working from the outside…)

Open Refine doesn’t seem to want to run with the other services in the free tier micro (1GB) machine instance, but at 2GB everything seems okay. (I don’t know if possible race conditions in starting services means that Open Refine could start and then block the Jupyter service’s request for resource.  I need to do an Apollo 13 style startup sequence exploration to see if all services can run in 1GB, I guess!) One thing I’ve added to the to do list is to split things out so into separate AMIs that will work on the 1GB free tier machines. I also want to check that I can provision the AMI from Vagrant, so students could then launch a local VM or an Amazon Instance that way, just by changing the vagrant provider. (Shared folders/volumes might get a bit messed up in that case, though?)

If services can run one at a time in the 1GB machines, it’d be nice to provide a simple dashboard to start and stop the services to make that easier to manage. Something that looks a bit like this, for example, exposed via an authenticated web page:

This needn’t be too complex – I had in mind a simple Python web app that could run under nginx (which currently provides a simple authentication layer for Open Refine to sit behind) and then just runs simple systemctl start, stop and restart commands on the appropriate service.

#fragment...
import os
os.system('systemctl restart jupyter.service')

I’m not sure how the status should be updated (based on whether a service is running or not) or what heartbeat it should update to. There may be better ways, of course, in which case please let me know via the comments:-)

I did have a quick look round for examples, but the dashboards/monitoring tools that do exist, such as pydash, are far more elaborate than what I had in mind. (If you know of a simple example to do the above, or can knock one up for me, please let me know via the comments. And the simpler the better ;-)

If we are to start exploring the use of browser accessed applications running inside user-managed VMs, this sort of simple application could be really handy… (Another approach would be to use a VM running docker, and then have a container manager running, such as portainer.)

Creating a Jupyter Bundler Extension to Download Zipped Notebook and HTML Files

In the first version of the TM351 VM, we had a simple toolbar extension that would download a zipped ipynb file, along with an HTML version of the notebook, so it could be uploaded and previewed in the OU Open Design Studio. (Yes, I know, it would have been much better to have an nbviewer handler as an ODS plugin, but the we don’t do that sort of tech innovation, apparently…)

Looking at updating the extension today for the latest version of Jupyter notebooks, I noticed the availability of custom bundler extensions that allow you to add additional tools to support notebook downloads and deployment (I’m not sure what deployment relates to?). Adding a new download option allows it to be added to the notebook Edit -&gt; Download menu:

The extension is created as a python package:

# odszip/setup.py
from setuptools import setup

setup(name='odszip',
      version='0.0.1',
      description='Save Jupyter notebook and HTML in zip file with .nbk suffix',
      author='',
      author_email='',
      license='MIT',
      packages=['odszip'],
      zip_safe=False)
#odszip/odszip/download.py

# Copyright (c) The Open University, 2017
# Copyright (c) Jupyter Development Team.

# Distributed under the terms of the Modified BSD License.
# Based on: https://github.com/jupyter-incubator/dashboards_bundlers/

import os
import shutil
import tempfile

#THIS IS A REQUIRED FUNCTION
def _jupyter_bundlerextension_paths():
    '''API for notebook bundler installation on notebook 5.0+'''
    return [{
                'name': 'odszip_download',
                'label': 'ODSzip (.nbk)',
                'module_name': 'odszip.download',
                'group': 'download'
            }]


def make_download_bundle(abs_nb_path, staging_dir, tools):
	'''
	Assembles the notebook and resources it needs, returning the path to a
	zip file bundling the notebook and its requirements if there are any,
	the notebook's path otherwise.
	:param abs_nb_path: The path to the notebook
	:param staging_dir: Temporary work directory, created and removed by the caller
	'''
    
	# Clean up bundle dir if it exists
	shutil.rmtree(staging_dir, True)
	os.makedirs(staging_dir)
	
	# Get name of notebook from filename
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	# Add the notebook
	shutil.copy2(abs_nb_path, os.path.join(staging_dir, notebook_basename))
	
	# Include HTML version of file
	cmd='jupyter nbconvert --to html "{abs_nb_path}" --output-dir "{staging_dir}"'.format(abs_nb_path=abs_nb_path,staging_dir=staging_dir)
	os.system(cmd)

	zip_file = shutil.make_archive(staging_dir, format='zip', root_dir=staging_dir, base_dir='.')
	return zip_file

#THIS IS A REQUIRED FUNCTION       
def bundle(handler, model):
	'''
	Downloads a notebook as an HTML file and zips it with the notebook
	'''
	
	# Based on https://github.com/jupyter-incubator/dashboards_bundlers
	
	abs_nb_path = os.path.join(
		handler.settings['contents_manager'].root_dir,
		model['path']
	)
		
	notebook_basename = os.path.basename(abs_nb_path)
	notebook_name = os.path.splitext(notebook_basename)[0]
	
	tmp_dir = tempfile.mkdtemp()

	output_dir = os.path.join(tmp_dir, notebook_name)
	bundle_path = make_download_bundle(abs_nb_path, output_dir, handler.tools)
		
	handler.set_header('Content-Disposition', 'attachment; filename="%s"' % (notebook_name + '.nbk'))
	
	handler.set_header('Content-Type', 'application/zip')
	
	with open(bundle_path, 'rb') as bundle_file:
		handler.write(bundle_file.read())

	handler.finish()


	# We read and send synchronously, so we can clean up safely after finish
	shutil.rmtree(tmp_dir, True)

We can then create the python package and install the extension, remmebering to restart the Jupyter server for the extension to take effect.

#Install the ODSzip extension package
pip3 install --upgrade --force-reinstall ./odszip

#Enable the ODSzip extension
jupyter bundlerextension enable --py odszip.download  --sys-prefix

Getting Web Services Up and Running on MicroSoft Azure Using Vagrant and the Azure CLI

As well as Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI, we can also use Vagrant to provision machines on other web hosts, such as the Microsoft Azure cloud paltform. In this post, I’ll describe a command line based recipe for doing just that.

To start with, you’ll need to get a Microsoft Azure account.

When you’ve done that, install the Azure command line interface (CLI). On a Mac:

curl -L https://aka.ms/InstallAzureCli | bash

For me, this installed to ~/bin/az.

With the client installed, login:

~/bin/az login

This requires a token based handshake with a Microsoft authentication website.

List the range of machine images available (if you haven’t set the path to az, use the full ~/bin/az):

az vm image list

There was only one that looked suitable to me for my purposes: Canonical:UbuntuServer:16.04-LTS:latest.

To run the provisioner, we need a Subscription ID; this will be used to set the vagrant .subscription_id parameter. These are listed on the Azure Portal.

We also need to create an Active Directory Service Principal:

az ad sp create-for-rbac

This information will be used to configure the Vagrantfile: the appId sets the vagrant .client_id, the password the .client_secret, and the tenant the .tenant_id.

You can also inspect the application in the App Registrations area of the Azure Portal.

Now let’s set up Vagrant. We’ll use the vagrant-azure plugin:

vagrant plugin install vagrant-azure --plugin-version '2.0.0.pre6'

We need to add a dummy box:

vagrant box add azure https://github.com/azure/vagrant-azure/raw/v2.0/dummy.box

Now let’s set up the Vagrantfile:

config.vm.provider :azure do |azure, override|
    #The path to your ssh keys
    override.ssh.private_key_path = '~/.ssh/id_rsa'

    #The default box we added
    override.vm.box = 'azure'
    
    #Set a territory
    azure.location="uksouth"

    #Provide your own group and VM name
    azure.resource_group_name="tm351azuretest"
    azure.vm_name="tm351azurevmtest"

    # Set an appropriate image (the UbuntuServer is actually the current default value)
    azure.vm_image_urn="Canonical:UbuntuServer:16.04-LTS:latest"

    #Use a valid subscription ID
    #https://portal.azure.com/#blade/HubsExtension/MyAccessBlade/resourceId/
    azure.subscription_id = ENV['AZURE_SUBSCRIPTION_ID']

    # Using details from the Active Directory Service Principal setup
    azure.tenant_id = ENV['AZURE_TENANT_ID']
    azure.client_id = ENV['AZURE_CLIENT_ID']
    azure.client_secret = ENV['AZURE_CLIENT_SECRET']

end

With the Vagrantfile parameters in place, we should then be able call the Azure provider using the command:

vagrant up --provider=azure

But we’re still not quite done… If you’re running services on the VM, populated from elsewhere in the Vagranfile, you’ll need to add some security rules to make the ports accessible. I’m running services on ports 80,35180 and 35181 for example:

az vm open-port -g tm351azuretest -n tm351azuretest --port 80 --priority 130
az vm open-port -g tm351azuretest -n tm351azuretest --port 35180 --priority 140
az vm open-port -g tm351azuretest -n tm351azuretest --port 35181 --priority 150

Now we can lookup the IP address of the server:

az vm list-ip-addresses

and see if our applications are there :-)

Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI

From past experience of trying to get things up and running with AWS (Amazon Web Services), it can be a bit of a faff trying to work out what to set where the first time. So here’s an example of how to get a browser based application up and running on EC2 using vagrant from the command line.

(If you want to work through sorting the settings out via the AWS online management console, try Oliver Veits’ tutorial AWS Automation based on Vagrant — Part 2: Installation and Usage of the Vagrant AWS Plugin; you might also need to refer to Part 1: Getting started with AWS.)

This post in part assumes you know how to provision your own virtual machine locally using Vagrant. Here are the steps you need to take to be able to run an AWS provisioner (on a Mac or Linux machine… not sure about Windows?).

First up – sign up for AWS (get credit via the Github Education Pack)…

Pick up some credentials via AWS root Security Credentials (Access Keys (Access Key ID and Secret Access Key)):

Ensure that the key is active (Make Active).

There’s quite a bit of set up to do to configure the provisioner script. This can be done on the command line using the Amazon Command Line Interface (AWS CLI):

pip install --upgrade --user awscli

Now you need to configure the AWS CLI:

aws configure

Use the security credentials you picked up to configure the client*.

When we launch the AWS machine, vagrant needs to be able to access it via ssh using the public IP address automatically assigned to the machine. In deployment too, if we’re building specific services we want to be able to access over the web, we need to open up access to the ports those services are listening on.

By default, the machine will be locked down, so we need to open up specific ports by setting security rules. These are assigned on the basis of a security group. So lets create one of those (mine is named after the course VM I’m building…):

aws ec2 create-security-group --group-name tm351cloud --description "Security group for tm351 services"

We’re going to use this group in the .security_groups parameter in the Vagrantfile.

Now we need to create the security group rules. In my case, I want to open up ssh (port 22) to allow incoming traffic from my IP address, and ports 80, 35180 and 35181 to allow http traffic from anywhere. (The /0 suffix in the rules allows any IP format.)

MYIP=$(curl http://checkip.amazonaws.com/)
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 22 --cidr ${MYIP}/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 80 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35180 --cidr 0.0.0.0/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35181 --cidr 0.0.0.0/0

# Check the policies
aws ec2 describe-security-groups --group-names tm351cloud

Having opened up at least the ssh port 22, we need to set up some SSH keys with a particular name (vagrantaws) that we will use with the vagrant .keypair_name parameter, and save them to a local file (vagrantaws.pem) with the appropriate permissions.

aws ec2 create-key-pair --key-name vagrantaws --query 'KeyMaterial' --output text > vagrantaws.pem
chmod 400 vagrantaws.pem

The vagrant provisioner also requires specific access tokens (.access_key_id, .secret_access_key, .session_token) to access EC2. Create these tokens, entering your own duration (in seconds):

aws sts get-session-token --duration-seconds 129600

Now we can start to look at the Vagrant set up. Install the vagrant AWS provisioner:

vagrant plugin install vagrant-aws

After setting up the Vagrantfile, you will be able to provision your machine on AWS using:

vagrant up --provider=aws

Add a dummy box:

vagrant box add awsdummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box

Now let’s look at the Vagrantfile:

#Set up the provider block
config.vm.provider :aws do |aws, override|

    #Use the ec2 security group set previously
    aws.security_groups = ["tm351cloud"]

    #Whatever name we want
    override.vm.hostname = "tm351aws"

    #The name of the dummy box we added
    override.vm.box = "awsdummy"
    
    #Set up machine access using our keypair name and ssh key path
    override.ssh.username = "ubuntu"
    aws.keypair_name="vagrantaws"
    override.ssh.private_key_path = "vagrantaws.pem"

    #Use the values generated by the session token generator
    aws.access_key_id = "YOUR_KEY_ID"
    aws.secret_access_key = "YOUR_SECRET_ACCESS_KEY"
    aws.session_token = "YOUR_SESSION_TOKEN"

    #Specify a region and valid ami for that region, along with the desired instance size
    aws.region = "eu-west-1"
    aws.ami = "ami-971238f1" 
    aws.instance_type="t2.small"

  end

Running vagrant up --provider=aws should run the Vagrant provisioner with the AWS provider. Running vagrant destroy will tear down the machine (so you don’t keep paying for it… I think the users, security groups and keypairs are free?)

To check on the IP address of your instance, run:

aws ec2 describe-instances

or check on the AWS EC2 console. You can also check the machine is ripped down correctly when you have finished with it from there.

(I need to check what happens if you vagrant suspend and then vagrant resume. Presumably, the state is preserved, but you are billed for storage, if not running time?)


*Alternatively, we could create a specific user with more limited credentials.

Create a user we can use to help set up the credentials to use with the vagrant provisioner:

aws iam create-user --user-name vagrant

Now we need to give that user permissions to build our EC2 instance, by attaching an appropriate security policy (AmazonEC2FullAccess). In other words, the Vagrantfile will make use of the AWS vagrant user to provision the machine, so we need to give that AWS user the appropraite permissions on AWS:

aws iam attach-user-policy --user-name vagrant --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

Generate some keys:

aws iam create-access-key --user-name vagrant

Then run aws configure with the new keys.