Programming, Coding & Digital Skills

I keep hearing myself in meetings talking about the “need” to get people coding, but that’s not really what I mean, and it immediately puts people off because I’m not sure they know what programming/coding is or what it’s useful for.

So here’s an example of the sort of thing I regularly do, pretty much naturally – automating simple tasks, a line or two at a time.

The problem was generating some data files containing weather data for several airports. I’d already got a pattern for the URL for the data file, now I just needed to find some airport codes (for airports in the capital cities of the BRICS countries) and grab the data into a separate file for each [code]:

In other words – figuring out what steps I need to do to solve a problem, then writing a line of code to do each step – often separately – looking at the output to check it’s what I expect, then using it as the input to the next step. (As you get more confident, you can start to bundle several lines together.)

The print statements are a bit overkill – I added them as commentary…

On its own, each line of code is quite simple. There are lots of high level packages out there to make powerful things happen with a single command. And there are lots of high level data representations that make it easier to work with particular things. pandas dataframes, for example, allow you to work natually the contents of a CSV data file or an Excel spreadsheet. And if you need to work with maps, there are packages to help with those too. (So for example, as an afterthought I added a quick example to the notebook showing how to add markers for the airports to a map… (I’m not sure if the map will render in the embed or the gist?) That code represents a recipe that can be copied and pasted and used with other datasets more or less directly.

So when folk talk about programming and coding, I’m not sure what they mean by it. The way we teach it in computing departments sucks, because it doesn’t represent the sort of use case above: using a line of code at a time, each one a possible timesaver, to do something useful. Each line of code is a self-made tool to do a particular task.

Enterprise software development has different constraints to the above, of course, and more formalised methods for developing and deploying code. But the number of people who could make use of code – doing the sorts of things demonstrated as per the example above – is far larger than than the number of developers we’ll ever need. (If more folk could build their own single line tools, or work through tasks a line of a code at a time, we may not need so many developers?)

So when it comes to talk of developing “digital skills” at scale, I think of the above example as being at the level we should be aspiring to. Scripting, rather then developer coding/programming (h/t @RossMackenzie for being the first to comment back with that mention). Because it’s in the reach of many people, and it allows them to start putting together their own single line code apps from the start, as well as developing more complex recipes, a line of code at a time.

And one of the reasons folk can become productive is because there are lots of helpful packages and examples of cribbable code out there. (Often, just one or two lines of code will fix the problem you can’t solve for yourself.)

Real programmers don’t write a million lines of code at a time – they often write a functional block – which may be just a line or a placeholder function – one block at a time. And whilst these single lines of code or simple blocks may combine to create a recipe that requires lots of steps, these are often organised in higher level functional blocks – which are themselves single steps at a higher level of abstraction. (How does the joke go? Recipe for world domination: step 1 – invade Poland etc.)

The problem solving process then becomes one of both top-down and bottom up: what do I want to do, what are the high-level steps that would help me achieve that, within each of those: can I code it as a single line, or do I need to break the problem into smaller steps?

Knowing some of the libraries that exist out there can help in this problem solving / decomposing the problem process. For example, to get Excel data into a data structure, I don’t need to know how to open a file, read in a million lines of XML, parse the XML, figure out how to represent that as a data structure, etc. I use the pandas.read_excel() function and pass it a filename.

If we want to start developing digital skills at scale, we need to get the initiatives out of the computing departments and into the technology departments, and science departments, and engineering departments, and humanities departments, and social science departments…

Google Admits Its Results Aren’t Facts And Uses Third Party Plugins to Keep You Informed Of That?



These fact checks are not Google’s and are presented so people can make more informed judgements. Even though differing conclusions may be presented, we think it’s still helpful for people to understand the degree of consensus around a particular claim and have clear information on which sources agree.

It seems that (my emphasis):

For publishers to be included in this feature, they must be using the ClaimReview markup on the specific pages where they fact check public statements … . Only publishers that are algorithmically determined to be an authoritative source of information will qualify for inclusion.

Two things:

  • it was “the algorithms” wot dun it originally; now there’s another “algorithm” to make it better… So that’s all right then. What can possibly go wrong?
  • remember when you absolutely had to put third party anti-virus applications onto your computer because the systems were so insecure? Isn’t that what Google’s resorting to? Third party help to flag that your machine (the Google results listing) may be infected.

Also bear in mind: Google isn’t a publisher, isn’t a broadcaster, has no editorial control (as Ian Knopke pointed out via the Twitterz, they do have editorial control. Okay.. but the way they apply it and justify that application is intended to keep them away from being recognised as a publisher in the way that news media organisations, or me as a blogger, are publishers…)


(You do know Google owns YouTube, right…?)

Getting Web Services Up and Running on MicroSoft Azure Using Vagrant and the Azure CLI

As well as Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI, we can also use Vagrant to provision machines on other web hosts, such as the Microsoft Azure cloud paltform. In this post, I’ll describe a command line based recipe for doing just that.

To start with, you’ll need to get a Microsoft Azure account.

When you’ve done that, install the Azure command line interface (CLI). On a Mac:

curl -L | bash

For me, this installed to ~/bin/az.

With the client installed, login:

~/bin/az login

This requires a token based handshake with a Microsoft authentication website.

List the range of machine images available (if you haven’t set the path to az, use the full ~/bin/az):

az vm image list

There was only one that looked suitable to me for my purposes: Canonical:UbuntuServer:16.04-LTS:latest.

To run the provisioner, we need a Subscription ID; this will be used to set the vagrant .subscription_id parameter. These are listed on the Azure Portal.

We also need to create an Active Directory Service Principal:

az ad sp create-for-rbac

This information will be used to configure the Vagrantfile: the appId sets the vagrant .client_id, the password the .client_secret, and the tenant the .tenant_id.

You can also inspect the application in the App Registrations area of the Azure Portal.

Now let’s set up Vagrant. We’ll use the vagrant-azure plugin:

vagrant plugin install vagrant-azure --plugin-version '2.0.0.pre6'

We need to add a dummy box:

vagrant box add azure

Now let’s set up the Vagrantfile:

config.vm.provider :azure do |azure, override|
    #The path to your ssh keys
    override.ssh.private_key_path = '~/.ssh/id_rsa'

    #The default box we added = 'azure'
    #Set a territory

    #Provide your own group and VM name

    # Set an appropriate image (the UbuntuServer is actually the current default value)

    #Use a valid subscription ID
    azure.subscription_id = ENV['AZURE_SUBSCRIPTION_ID']

    # Using details from the Active Directory Service Principal setup
    azure.tenant_id = ENV['AZURE_TENANT_ID']
    azure.client_id = ENV['AZURE_CLIENT_ID']
    azure.client_secret = ENV['AZURE_CLIENT_SECRET']


With the Vagrantfile parameters in place, we should then be able call the Azure provider using the command:

vagrant up --provider=azure

But we’re still not quite done… If you’re running services on the VM, populated from elsewhere in the Vagranfile, you’ll need to add some security rules to make the ports accessible. I’m running services on ports 80,35180 and 35181 for example:

az vm open-port -g tm351azuretest -n tm351azuretest --port 80 --priority 130
az vm open-port -g tm351azuretest -n tm351azuretest --port 35180 --priority 140
az vm open-port -g tm351azuretest -n tm351azuretest --port 35181 --priority 150

Now we can lookup the IP address of the server:

az vm list-ip-addresses

and see if our applications are there :-)

Getting Web Services Up and Running on Amazon Web Services (AWS) Using Vagrant and the AWS CLI

From past experience of trying to get things up and running with AWS (Amazon Web Services), it can be a bit of a faff trying to work out what to set where the first time. So here’s an example of how to get a browser based application up and running on EC2 using vagrant from the command line.

(If you want to work through sorting the settings out via the AWS online management console, try Oliver Veits’ tutorial AWS Automation based on Vagrant — Part 2: Installation and Usage of the Vagrant AWS Plugin; you might also need to refer to Part 1: Getting started with AWS.)

This post in part assumes you know how to provision your own virtual machine locally using Vagrant. Here are the steps you need to take to be able to run an AWS provisioner (on a Mac or Linux machine… not sure about Windows?).

First up – sign up for AWS (get credit via the Github Education Pack)…

Pick up some credentials via AWS root Security Credentials (Access Keys (Access Key ID and Secret Access Key)):

Ensure that the key is active (Make Active).

There’s quite a bit of set up to do to configure the provisioner script. This can be done on the command line using the Amazon Command Line Interface (AWS CLI):

pip install --upgrade --user awscli

Now you need to configure the AWS CLI:

aws configure

Use the security credentials you picked up to configure the client*.

When we launch the AWS machine, vagrant needs to be able to access it via ssh using the public IP address automatically assigned to the machine. In deployment too, if we’re building specific services we want to be able to access over the web, we need to open up access to the ports those services are listening on.

By default, the machine will be locked down, so we need to open up specific ports by setting security rules. These are assigned on the basis of a security group. So lets create one of those (mine is named after the course VM I’m building…):

aws ec2 create-security-group --group-name tm351cloud --description "Security group for tm351 services"

We’re going to use this group in the .security_groups parameter in the Vagrantfile.

Now we need to create the security group rules. In my case, I want to open up ssh (port 22) to allow incoming traffic from my IP address, and ports 80, 35180 and 35181 to allow http traffic from anywhere. (The /0 suffix in the rules allows any IP format.)

aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 22 --cidr ${MYIP}/0
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 80 --cidr
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35180 --cidr
aws ec2 authorize-security-group-ingress --group-name tm351cloud --protocol tcp --port 35181 --cidr

# Check the policies
aws ec2 describe-security-groups --group-names tm351cloud

Having opened up at least the ssh port 22, we need to set up some SSH keys with a particular name (vagrantaws) that we will use with the vagrant .keypair_name parameter, and save them to a local file (vagrantaws.pem) with the appropriate permissions.

aws ec2 create-key-pair --key-name vagrantaws --query 'KeyMaterial' --output text > vagrantaws.pem
chmod 400 vagrantaws.pem

The vagrant provisioner also requires specific access tokens (.access_key_id, .secret_access_key, .session_token) to access EC2. Create these tokens, entering your own duration (in seconds):

aws sts get-session-token --duration-seconds 129600

Now we can start to look at the Vagrant set up. Install the vagrant AWS provisioner:

vagrant plugin install vagrant-aws

After setting up the Vagrantfile, you will be able to provision your machine on AWS using:

vagrant up --provider=aws

Add a dummy box:

vagrant box add awsdummy

Now let’s look at the Vagrantfile:

#Set up the provider block
config.vm.provider :aws do |aws, override|

    #Use the ec2 security group set previously
    aws.security_groups = ["tm351cloud"]

    #Whatever name we want
    override.vm.hostname = "tm351aws"

    #The name of the dummy box we added = "awsdummy"
    #Set up machine access using our keypair name and ssh key path
    override.ssh.username = "ubuntu"
    override.ssh.private_key_path = "vagrantaws.pem"

    #Use the values generated by the session token generator
    aws.access_key_id = "YOUR_KEY_ID"
    aws.secret_access_key = "YOUR_SECRET_ACCESS_KEY"
    aws.session_token = "YOUR_SESSION_TOKEN"

    #Specify a region and valid ami for that region, along with the desired instance size
    aws.region = "eu-west-1"
    aws.ami = "ami-971238f1" 


Running vagrant up --provider=aws should run the Vagrant provisioner with the AWS provider. Running vagrant destroy will tear down the machine (so you don’t keep paying for it… I think the users, security groups and keypairs are free?)

To check on the IP address of your instance, run:

aws ec2 describe-instances

or check on the AWS EC2 console. You can also check the machine is ripped down correctly when you have finished with it from there.

(I need to check what happens if you vagrant suspend and then vagrant resume. Presumably, the state is preserved, but you are billed for storage, if not running time?)

*Alternatively, we could create a specific user with more limited credentials.

Create a user we can use to help set up the credentials to use with the vagrant provisioner:

aws iam create-user --user-name vagrant

Now we need to give that user permissions to build our EC2 instance, by attaching an appropriate security policy (AmazonEC2FullAccess). In other words, the Vagrantfile will make use of the AWS vagrant user to provision the machine, so we need to give that AWS user the appropraite permissions on AWS:

aws iam attach-user-policy --user-name vagrant --policy-arn arn:aws:iam::aws:policy/AmazonEC2FullAccess

Generate some keys:

aws iam create-access-key --user-name vagrant

Then run aws configure with the new keys.

Simple Authenticated Access to VM Services Using NGINX and Vagrant Port Forwarding

Tinkering with the OU TM351 VM, looking at putting together an Amazon AWS AMI version, I started to wonder about how I could add a simple authentication layer to mediate public web access so students don’t fire up an image on their dollar and then find other folk using it.

So… h/t to Adam McGreggor for pointing me to nginx. Using this and a smattering of other cribs, I soon got to this (

#!/usr/bin/env bash

#Install nginx
#apache2-utils contains htpassword command to configure password used to restrict access to target ports
sudo apt-get update && sudo apt-get install -y nginx apache2-utils

#Create a password (test) for user tm351
#Optionally set password via environment variable - TMP_PASS - from Vagrantfile
#If TMP_PASS not set, use default password: test
sudo htpasswd -b -c /etc/nginx/.htpasswd tm351 "${TMP_PASS-test}"

Now we need to create a config file for nginx. Define each service separately, on the top level path (/) for each service (which is referenced relative to its own port).

#Jupyter notebook running on port 8888 inside the VM
upstream notebooks {

#OpenRefine running on port 3333 inside the VM
upstream refine {

#Create a simple (unauthenticated) server on port 80
#The served files should be placed in /var/www/html/*
server {
  listen 80;
  location / {
    root /var/www/html ;
    index index.html;

server {
  #Configure the server to listen on internal port 35180 as an authenticated proxy for internal 8888
  listen 35180;

  auth_basic "Protected...";
  auth_basic_user_file /etc/nginx/.htpasswd;

  location / {
    proxy_pass http://notebooks;
    proxy_redirect off;

server {
  #Configure the server to listen on internal port 35181 as an authenticated proxy for internal 8888
  listen 35181;
  auth_basic "Protected...";
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://refine;
    proxy_redirect off;
sudo echo "$config" > /etc/nginx/sites-available/default

#if that doesn't work, eg wrt permissions, try a workaround:
#sudo echo "$config" > default
#sudo mv default /etc/nginx/sites-available/default
#sudo chmod 0644 /etc/nginx/sites-available/default
#sudo chown root /etc/nginx/sites-available/default
#sudo chown :root /etc/nginx/sites-available/default

#Restart nginx with the new configuration
sudo service nginx reload

The password (set on the command line vagrant is called from using export TMP_PASS="NEW PASSWORD") can be passed in from the Vagrantfile for use by as follows:

config.vm.provision :shell, :env => {"TMP_PASS" => ENV["TMP_PASS"]}, :inline => <<-SH
  	source /vagrant/build/

Setting up port forwarding in my Vagrantfile then looks like this:

config.vm.provider :virtualbox do |virtualbox|

	#jupyter authenticated - expose internal port 35180 on localhost:35180 :forwarded_port, guest: 35180, host: 35180, auto_correct: true

	#refine authenticated - expose internal port 35181 on localhost:35181 :forwarded_port, guest: 35181, host: 35181, auto_correct: true


Running the vagrant provisioner, I now have simple authenticated access to the notebook and OpenRefine servers:

Could be a handy quick recipe, that…

See also: Course Apps in the the Cloud – Experimenting With Open Refine on Digital Ocean, Linode and AWS / Amazon EC2 Web Services

PS Only of course it doesn’t quite work like that – because the I’d originally defined the services to be listening over all network ranges on… instead they need to listen on…

Course Apps in the the Cloud – Experimenting With Open Refine on Digital Ocean, Linode and AWS / Amazon EC2 Web Services

With OUr data management and analysis course coming up to its third presentation start in October, various revisions and updates are currently being made to the materials, in part based on feedback from students, in part based the module team’s reflections on how the course material is performing.

We also have an opportunity to update the virtual machine supplied to students, so I’ve spent the last couple of days poking around in the various script rewrites I’ve toyed with over the last couple of years. When we started the course, Jupyter notebooks were still called IPython notebooks, and the ecosystem was still in its infancy. But whilst the module review process means changes are supposed to be kept to a minimum, there is still an opportunity to bake a few more tools into the VM that didn’t exist a couple of years ago when the VM was first gold mastered. (I’ll do a review of some of the Jupyter notebook features that I think should be released into the VM in another post.)

When the VM was first put together, I took it as an opportunity to explore automated build processes. The VM itself was built from Puppet scripts orchestrated from Vagrant, with another Vagrant script managing the machine we delivered to students (setting up shared folders, handling port forwarding, and giving the internal services a kick if required). I also explored a dockerised version, but Docker too was still in its infancy when we first looked at how to best virtualise the services and apps distributed as part of the course materials (IPython/Jupyter notebooks, PostgreSQL, MongoDB and OpenRefine). With Docker now having native versions for recent Macs and Windows platforms, I thought it might be worth exploring again; but OUr student computing policy means we have to build to lowest common denominator machines that are years old (though I’m ignoring the 32 bit hardware platform constraint and we’ll post an online workaround – or ship a Raspberry Pi version of the VM – if we have to!).

So… to demo where I’m at in terms of process, and keep a note to myself, the build has forsaken Puppet and I’ve gone back to simple shell scripts. As an example of most of the tricks I’ve had to invoke, I’ll post recipes for getting OpenRefine up and running on several virtual hosts in several different ways. Still to do is a dockerised version and and RPi version of the TM351 VM config, but I’m hoping the shell scripts will all be reusable (and if not, I’ll try to tweak them so they work as is as part of whatever build process is required…

To begin with, the builder shell scripts are as follows (.sh files all end up requiring execute permissions granted somehow…).

Structure is:


The main build script calls a script to add in base packages, and scripts for each application (in their own folder). I really should have had the same invocation filename or filename pattern (e.g. reusing the directory name) in each build folder.

## ./quickbuild/
#chmod ugo+x on this file

#!/usr/bin/env bash
#Set the base build directory to the one containing this script
THISDIR=$(dirname "$0")

chmod ugo+x $THISDIR/
chmod ugo+x $THISDIR/openrefine/

#Build script for building machine


#tidy up
apt-get autoremove -y && apt-get clean && updatedb

The base packages script does some updating of package lists and then pulls in a range of essential utility packages, some of which are actually required for builds further down the line.

## ./quickbuild/

#!/usr/bin/env bash

#Build script for building machine
apt-get clean && apt-get -y update && apt-get -y upgrade && apt-get install -y bash-completion vim curl zip unzip bzip2 && apt-get install -y build-essential gcc && apt-get install -y g++ gfortran && apt-get install -y libatlas-base-dev libfreetype6-dev libpng-dev libhdf5-serial-dev && apt-get install -y git python3 python3-dev python3-pip && pip3 install --upgrade pip

The application build files install additional packages specific to the application or its build process. We had some issues with service starts in the original VM (Ubuntu 14.04 LTS), but the service management in Ubuntu 16.04 LTS is much cleaner – and in my own testing so far, much more reliable.

# ./quickbuild/openrefine/

THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

apt-get -y update && apt-get install -y wget ant unzip openjdk-8-jre-headless && apt-get clean -y

echo "Setting up OpenRefine: "

#Prep for download
mkdir -p /opt
mkdir -p /root

if [ ! -f /opt/openrefine.done ]; then
	echo "Downloading OpenRefine..."
	wget -q --no-check-certificate  -P /root
	echo "...downloaded OpenRefine"

	echo "Unpacking OpenRefine..."
	tar -xzf /root/openrefine-linux-2.7-rc.2.tar.gz -C /opt  && rm /root/openrefine-linux-2.7-rc.2.tar.gz
	#Unpacks to: /opt/openrefine-2.7-rc.2
	touch /opt/openrefine.done
	echo "...unpacked OpenRefine"
	echo "...already downloaded and unpacked OpenRefine"

cp $THISDIR/services/refine.service /lib/systemd/system/refine.service

# Enable autostart
sudo systemctl enable refine.service

# Refresh service config
sudo systemctl daemon-reload

#(Re)start service
sudo systemctl restart refine.service

Applications are run as services, where possible. If I get a chance – and space/resource requirements allow – I made add some service monitoring to try to ensure application services are always running when the VM is running.

## ./quickbuild/openrefine/services/refine.service

#When to bring the service up
#Wait for a network stack to appear
#If we actually need the network to have a routable IP address: 

ExecStart=/opt/openrefine-2.7-rc.2/refine -p 3334 -d /vagrant/openrefine_projects


Everything can be packaged up in a zip file with a command (tuned to omit Mac cruft, in part) of the form:

zip -r quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

So those are the files and the basic outline. Our initial plan is to run the VMs once again locally on a student’s own machine, using Virtualbox. I think we’ll stick with vagrant to manage this, not least because we can issue updates via new Vagrantfiles, not that we’ve done that to date…

By the by, I’m running vagrant with a handful of plugins:

#Speed up repeated builds
vagrant plugin install vagrant-cachier

#Use correct Virtualbox Guest Additions
vagrant plugin install vagrant-vbguest

#Help with provisioning to virtual hosts
vagrant plugin install vagrant-digitalocean
vagrant plugin install vagrant-linode
vagrant plugin install vagrant-aws

The following Vagrantfile builds the local Virtualbox instance by default. To build to DOgital Ocean or Linode, use the following:

  • vagrant up --provider=digital_ocean
  • vagrant up --provider=linode

I didn’t get the AWS vagrant provisioner to work (too many things to go wrong in terms of settings!)

The Linode build also required a hack to get the box to build correctly…

# ./quickbuild/Vagrantfile

#Vagrantfile for building machine from build scripts

Vagrant.configure("2") do |config|

#------------------------- PROVIDER: VIRTUALBOX (BUILD) ------------------------------

  config.vm.provider :virtualbox do |virtualbox|

      #ubuntu/xenial bug? = "bento/ubuntu-16.04"
      #Stick with the default key

      #For local testing: = "tm351basebuild"
      #override.vm.box_url = "eg URL on dropbox"
      #config.vm.box_url = "../boxes/"

      config.vm.hostname = "tm351base" = "tm351basebuildbuild"
      #We need the memory to install scipy and build indexes on seeded mongodb
      #After the build it can be reduced back down to 1024
      virtualbox.memory = 2048
      #virtualbox.cpus = 1
      # virtualbox.gui = true

      #---- START PORT FORWARDING ----
      #Registered ports:
      #openrefine :forwarded_port, guest: 3334, host: 35101, auto_correct: true

      #---- END PORT FORWARDING ----

#------------------------- END PROVIDER: VIRTUALBOX (BUILD) ------------------------------

#------------------------- PROVIDER: DIGITAL OCEAN ------------------------------

config.vm.provider :digital_ocean do |provider, override|
        override.ssh.private_key_path = '~/.ssh/id_rsa' = 'digital_ocean'
        override.vm.box_url = ""
        provider.token = 'YOUR_TOKEN'
        provider.image = 'ubuntu-16-04-x64'
        provider.region = 'lon1'
        provider.size = '2gb'


#------------------------- END PROVIDER: DIGITAL OCEAN ------------------------------

#------------------------- PROVIDER: LINODE ------------------------------

config.vm.provider :linode do |provider, override|
    override.ssh.private_key_path = '~/.ssh/id_rsa' = 'linode/ubuntu1604'

    provider.api_key = 'YOUR KEY'
    provider.distribution = 'Ubuntu 16.04 LTS'
    provider.datacenter = 'london'
    provider.plan = 'Linode 2048'

    #grub needs updating - but want's to do it interactively
    #this bit of voodoo from Stack Overflow hacks a non-interactive install of it
    override.vm.provision :shell, :inline => <<-SH
    	apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y -o DPkg::options::="--force-confdef" -o DPkg::options::="--force-confold"  install grub-pc


#------------------------- END PROVIDER: LINODE ------------------------------

#------------------------- PROVIDER: AWS ------------------------------


  config.vm.provider :aws do |aws, override|
  	config.vm.hostname = "tm351aws"
  	#vagrant box add dummy = "dummy"
    aws.access_key_id = ""
    aws.secret_access_key = ""

    #Download and install the Amazon Command Line Interface
    #Configure the command line interface
    #$aws configure
    #Request the session token
    #$aws sts get-session-token --duration-seconds 129600 (enter your own duration)
    aws.session_token = ""

    #Keypair also generated via AWS console?
    aws.keypair_name = "vagrantAWSkeypair"

    aws.region = "eu-west-2a"
    aws.ami = "ami-ed908589"

    override.ssh.username = "ubuntu"
    override.ssh.private_key_path =  '~/.ssh/id_rsa'



#------------------------- END PROVIDER: AWS ------------------------------


  config.vm.provision :shell, :inline => <<-SH
  	#Add build scripts here
  	cd /vagrant/build
  	source ./


(The vagrant script can be tidied to hide keys by setting eg export DIGITAL_OCEAN_TOKEN="YOUR TOKEN HERE" from the command line you call vagrant from, and in the Vagrantfile setting provider.token = ENV['DIGITAL_OCEAN_TOKEN']).)

One of the nice things about the current version of vagrant is that you have to destroy a machine before launching another one of the same name with a different provisioners (though this looks set to change in forthcoming versions of vagrant). Why nice? Because the vagrant destroy command kills the node the machine is running on – so it won’t be left running and you won’t forget to turn it off (and won’t keep the meter running….)

Firing up the boxes on various hosts, go to port 3334 at the appropriate IP address and you should see OpenRefine running there…

Having failed to get the machine up and running on AWS, I though I’d try the simple route of packaging an AMI using Packer.

The build script was remarkably simple – once I got one that worked!


  "variables": {
    "aws_access_key": "",
    "aws_secret_key": ""
  "builders": [{
    "type": "amazon-ebs",
    "access_key": "{{user `aws_access_key`}}",
    "secret_key": "{{user `aws_secret_key`}}",
    "region": "eu-west-1",
    "source_ami": "ami-971238f1",
    "instance_type": "t2.micro",
    "ssh_username": "ubuntu",
    "ami_name": "openrefine",
    "security_group_id": "OPTIONAL_YOUR_VAGRANT_GROUP"

  "provisioners": [

      "destination": "/tmp/",
      "source": "./toupload/",
      "type": "file"
      "inline": [
        "cd /tmp && sudo apt-get update && sudo apt-get install unzip && sudo unzip /tmp/ -d /tmp && sudo chmod ugo+x /tmp/quickbuild/ && sudo /tmp/quickbuild/ "
      "type": "shell"


(The eu-west-2 (London) region wasn’t recognised by Packer for some reason…)

The machine can now be built on AWS and packaged as an AMI using Packer as follows (top level security tokens can be generated from the AWS Security Credentials console):

#Package the build files
mkdir -p toupload && zip -r toupload/ quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

#Pack the machine
packer build -var 'aws_access_key=YOUR_KEY' -var 'aws_secret_key=YOUR_SECRET' awspacker.json

Launching an instance of this AMI, I found that I couldn’t connect to the OpenRefine port (it just hung). The fix was to amend the automatically created security group rules (which by default just allow ssh on port 22) with a a Custom TCP rule that allowed incoming traffic on port 3334 from All Domains.

Which meant success:

To simplify matters, I then copied this edited security group to my own “openrefine” security group that I could use as the basis of the AMI packaging.

Just one thing to note about creating an AMI – Amazon will start billing you for it… As the Packer Getting Started guide suggests:

After running the above example, your AWS account now has an AMI associated with it. AMIs are stored in S3 by Amazon, so unless you want to be charged about $0.01 per month, you’ll probably want to remove it. Remove the AMI by first deregistering it on the AWS AMI management page. Next, delete the associated snapshot on the AWS snapshot management page.

Next up, I need to try a full build of the TM351 VM on AWS (a full build without the Mongo shard activity (which I couldn’t get to work yesterday – though this looks like it could provide a handy helper script (and I maybe also need to work through this.) The fuller build seems fine from the vagrant script in Virtualbox, Digital Ocean and Linode.

After that (and fixing the Mongo sharding thing), I’ll see if I can weave the build scripts into a set of interconnected Docker containers, one Dockerfile per application and a docker-compose.yml to weave them together. (See the original test from way back when.)

And then there’ll just be the look-see to see whether we can get the machine built and running on a Raspberry Pi 3 model B.

I also started wondering about whether I should pop a simple Flask app into the VM on port 80, showing an OU splash screen and a “Welcome to TM351” message… If I can get that running, then we have a means of piping stuff into a web page on the students’ own machines that is completely out of the controlling hands of LTS:-)

PS for an example of how to set up authentication over these services, see: Simple Authenticated Access to VM Services Using NGINX and Vagrant Port Forwarding.

Tracking down Data Files Associated With Parliamentary Business

One of the ways of finding data related files scattered around an organisations website is to run a web search using a search limit that specifies a data-y filetype, such as xlsx  for an Excel spreadsheet (csv and xls are also good candidates). For example, on the Parliament website, we could run a query along the lines of filetype:xlsx and then opt to display the omitted results:

Taken together, these files form an ad hoc datastore (e.g. as per this demo on using FOI response on WhatDoTheyKnow as an “as if” open datastore).

Looking at the URLs, we see that data containing files are strewn about the online Parliamentary estate (that is, the website;-)…

Freedom of Information Related Datasets

Parliament seems to be quite open in the way is handles its FOI responses, publishing disclosure logs and releasing datafile attachments rooted on

Written Questions

Responses to Written Questions often come with datafile attachments.

These are files are posted to the subdomain

Given the numeric key for a particular question, we can run a query on the Written Answers API to find details about the attachment:

Looking at the actual URL , something like, it looks as if some guesswork is required generating the URL from the data contained in the API response? (For example, how might original attachments might distinguish from other attachments (such as “revised” ones, maybe?).)

Written Statements

Written statements often come with one of more data file attachments.

The data files also appear on the subdomain although it looks like they’re on a different path to the answered question attachments ( compared to This subdomain doesn’t appear to have the data files indexed and searchable on Google? I don’t see a Written Statements API on either?

Deposited Papers

Deposited papers often include supporting documents, including spreadsheets.

Files are located under

At the current time there is no API search over deposited papers.

Committee Papers

A range of documents may be associated with Committees, including reports, responses to reports, and correspondence, as well as evidence submissions. These appear to mainly be PDF documents. Written evidence documents are rooted on and can be found from committee written evidence web (HTML) pages rooted on the same path (example).

A web search for inurl:committee (filetype:xls OR filetype:csv OR filetype:xlsx) doesn’t turn up any results.

Parliamentary Research Briefings

Research briefings are published by Commons and Lords Libraries, and may include additional documents.

Briefings may be published along with supporting documents, including spreadsheets:

The files are published under the following subdomain and path:

The file attachments URLs can be found via the Research Briefings API.

This response is a cut down result – the full resource description, including links to supplementary items, can be found by keying on the numeric identifier from the URI _about which the “naturally” identified resource (e.g. SN06643) is described.


Data files can be found variously around the Parliamentary website, including down the following paths:

(I don’t think the API supports querying resources that specifically include attachments in general, or attachments of a particular filetype?)

What would be nice would be support for discovering some of these resources. A quick way in to this would be the ability to limit search query responses to webpages that link to a data file, on the grounds that the linking web page probably contains some of the keywords that you’re likely to be searching for data around?