Category: Tinkering

Simple Authenticated Access to VM Services Using NGINX and Vagrant Port Forwarding

Tinkering with the OU TM351 VM, looking at putting together an Amazon AWS AMI version, I started to wonder about how I could add a simple authentication layer to mediate public web access so students don’t fire up an image on their dollar and then find other folk using it.

So… h/t to Adam McGreggor for pointing me to nginx. Using this and a smattering of other cribs, I soon got to this (simple_auth.sh):

#!/usr/bin/env bash

#Install nginx
#apache2-utils contains htpassword command to configure password used to restrict access to target ports
sudo apt-get update && sudo apt-get install -y nginx apache2-utils

#Create a password (test) for user tm351
#Optionally set password via environment variable - TMP_PASS - from Vagrantfile
#If TMP_PASS not set, use default password: test
sudo htpasswd -b -c /etc/nginx/.htpasswd tm351 "${TMP_PASS-test}"

Now we need to create a config file for nginx. Define each service separately, on the top level path (/) for each service (which is referenced relative to its own port).

config="""
#Jupyter notebook running on port 8888 inside the VM
upstream notebooks {
  server 127.0.0.1:8888;
}

#OpenRefine running on port 3333 inside the VM
upstream refine {
  server 127.0.0.1:3333;
}

#Create a simple (unauthenticated) server on port 80
#The served files should be placed in /var/www/html/*
server {
  listen 80;
  location / {
    root /var/www/html ;
    index index.html;
  }
}

server {
  #Configure the server to listen on internal port 35180 as an authenticated proxy for internal 8888
  listen 35180;

  auth_basic "Protected...";
  auth_basic_user_file /etc/nginx/.htpasswd;

  location / {
    proxy_pass http://notebooks;
    proxy_redirect off;
  }
}

server {
  #Configure the server to listen on internal port 35181 as an authenticated proxy for internal 8888
  listen 35181;
  auth_basic "Protected...";
  auth_basic_user_file /etc/nginx/.htpasswd;
  location / {
    proxy_pass http://refine;
    proxy_redirect off;
  }
}
"""
sudo echo "$config" > /etc/nginx/sites-available/default

#if that doesn't work, eg wrt permissions, try a workaround:
#sudo echo "$config" > default
#sudo mv default /etc/nginx/sites-available/default
#sudo chmod 0644 /etc/nginx/sites-available/default
#sudo chown root /etc/nginx/sites-available/default
#sudo chown :root /etc/nginx/sites-available/default


#Restart nginx with the new configuration
sudo service nginx reload

The password (set on the command line vagrant is called from using export TMP_PASS="NEW PASSWORD") can be passed in from the Vagrantfile for use by simple_auth.sh as follows:

config.vm.provision :shell, :env => {"TMP_PASS" => ENV["TMP_PASS"]}, :inline => <<-SH
  	source /vagrant/build/simple_auth.sh
  SH

Setting up port forwarding in my Vagrantfile then looks like this:

config.vm.provider :virtualbox do |virtualbox|

	#---- BEGIN PORT FORWARDING ----
	#jupyter authenticated - expose internal port 35180 on localhost:35180
	config.vm.network :forwarded_port, guest: 35180, host: 35180, auto_correct: true

	#refine authenticated - expose internal port 35181 on localhost:35181
	config.vm.network :forwarded_port, guest: 35181, host: 35181, auto_correct: true

	#---- END PORT FORWARDING ----
	
end

Running the vagrant provisioner, I now have simple authenticated access to the notebook and OpenRefine servers:

Could be a handy quick recipe, that…

See also: Course Apps in the the Cloud – Experimenting With Open Refine on Digital Ocean, Linode and AWS / Amazon EC2 Web Services

PS Only of course it doesn’t quite work like that – because the I’d originally defined the services to be listening over all network ranges on 0.0.0.0… instead they need to listen on 127.0.0.1…

Course Apps in the the Cloud – Experimenting With Open Refine on Digital Ocean, Linode and AWS / Amazon EC2 Web Services

With OUr data management and analysis course coming up to its third presentation start in October, various revisions and updates are currently being made to the materials, in part based on feedback from students, in part based the module team’s reflections on how the course material is performing.

We also have an opportunity to update the virtual machine supplied to students, so I’ve spent the last couple of days poking around in the various script rewrites I’ve toyed with over the last couple of years. When we started the course, Jupyter notebooks were still called IPython notebooks, and the ecosystem was still in its infancy. But whilst the module review process means changes are supposed to be kept to a minimum, there is still an opportunity to bake a few more tools into the VM that didn’t exist a couple of years ago when the VM was first gold mastered. (I’ll do a review of some of the Jupyter notebook features that I think should be released into the VM in another post.)

When the VM was first put together, I took it as an opportunity to explore automated build processes. The VM itself was built from Puppet scripts orchestrated from Vagrant, with another Vagrant script managing the machine we delivered to students (setting up shared folders, handling port forwarding, and giving the internal services a kick if required). I also explored a dockerised version, but Docker too was still in its infancy when we first looked at how to best virtualise the services and apps distributed as part of the course materials (IPython/Jupyter notebooks, PostgreSQL, MongoDB and OpenRefine). With Docker now having native versions for recent Macs and Windows platforms, I thought it might be worth exploring again; but OUr student computing policy means we have to build to lowest common denominator machines that are years old (though I’m ignoring the 32 bit hardware platform constraint and we’ll post an online workaround – or ship a Raspberry Pi version of the VM – if we have to!).

So… to demo where I’m at in terms of process, and keep a note to myself, the build has forsaken Puppet and I’ve gone back to simple shell scripts. As an example of most of the tricks I’ve had to invoke, I’ll post recipes for getting OpenRefine up and running on several virtual hosts in several different ways. Still to do is a dockerised version and and RPi version of the TM351 VM config, but I’m hoping the shell scripts will all be reusable (and if not, I’ll try to tweak them so they work as is as part of whatever build process is required…

To begin with, the builder shell scripts are as follows (.sh files all end up requiring execute permissions granted somehow…).

Structure is:

./quickbuild/quick_build.sh
./quickbuild/basepackages.sh
./quickbuild/openrefine/openrefine.sh
./quickbuild/openrefine/services/refine.service

The main build script calls a script to add in base packages, and scripts for each application (in their own folder). I really should have had the same invocation filename or filename pattern (e.g. reusing the directory name) in each build folder.

## ./quickbuild/quick_build.sh
#chmod ugo+x on this file

#!/usr/bin/env bash
#Set the base build directory to the one containing this script
THISDIR=$(dirname "$0")

chmod ugo+x $THISDIR/basepackages.sh
chmod ugo+x $THISDIR/openrefine/openrefine.sh

#Build script for building machine
$THISDIR/basepackages.sh

$THISDIR/openrefine/openrefine.sh

#tidy up
apt-get autoremove -y && apt-get clean && updatedb

The base packages script does some updating of package lists and then pulls in a range of essential utility packages, some of which are actually required for builds further down the line.

## ./quickbuild/basepackages.sh

#!/usr/bin/env bash

#Build script for building machine
apt-get clean && apt-get -y update && apt-get -y upgrade && apt-get install -y bash-completion vim curl zip unzip bzip2 && apt-get install -y build-essential gcc && apt-get install -y g++ gfortran && apt-get install -y libatlas-base-dev libfreetype6-dev libpng-dev libhdf5-serial-dev && apt-get install -y git python3 python3-dev python3-pip && pip3 install --upgrade pip

The application build files install additional packages specific to the application or its build process. We had some issues with service starts in the original VM (Ubuntu 14.04 LTS), but the service management in Ubuntu 16.04 LTS is much cleaner – and in my own testing so far, much more reliable.

# ./quickbuild/openrefine/openrefine.sh
#!/bin/bash

THISDIR="$( cd "$( dirname "${BASH_SOURCE[0]}" )" && pwd )"

apt-get -y update && apt-get install -y wget ant unzip openjdk-8-jre-headless && apt-get clean -y

echo "Setting up OpenRefine: "

#Prep for download
mkdir -p /opt
mkdir -p /root

if [ ! -f /opt/openrefine.done ]; then
	echo "Downloading OpenRefine..."
	wget -q --no-check-certificate  -P /root https://github.com/OpenRefine/OpenRefine/releases/download/2.7-rc.2/openrefine-linux-2.7-rc.2.tar.gz
	echo "...downloaded OpenRefine"

	echo "Unpacking OpenRefine..."
	tar -xzf /root/openrefine-linux-2.7-rc.2.tar.gz -C /opt  && rm /root/openrefine-linux-2.7-rc.2.tar.gz
	#Unpacks to: /opt/openrefine-2.7-rc.2
	touch /opt/openrefine.done
	echo "...unpacked OpenRefine"
else
	echo "...already downloaded and unpacked OpenRefine"
fi

cp $THISDIR/services/refine.service /lib/systemd/system/refine.service

# Enable autostart
sudo systemctl enable refine.service

# Refresh service config
sudo systemctl daemon-reload

#(Re)start service
sudo systemctl restart refine.service

Applications are run as services, where possible. If I get a chance – and space/resource requirements allow – I made add some service monitoring to try to ensure application services are always running when the VM is running.

## ./quickbuild/openrefine/services/refine.service
[Unit]
Description=Refine

#When to bring the service up
#via https://www.freedesktop.org/wiki/Software/systemd/NetworkTarget/
#Wait for a network stack to appear
After=network.target
#If we actually need the network to have a routable IP address:
#After=network-online.target 

[Service]
Environment=REFINE_HOST=0.0.0.0
ExecStart=/opt/openrefine-2.7-rc.2/refine -p 3334 -d /vagrant/openrefine_projects
Restart=always
RestartSec=10

[Install]
WantedBy=multi-user.target

Everything can be packaged up in a zip file with a command (tuned to omit Mac cruft, in part) of the form:

zip -r quickbuild.zip quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

So those are the files and the basic outline. Our initial plan is to run the VMs once again locally on a student’s own machine, using Virtualbox. I think we’ll stick with vagrant to manage this, not least because we can issue updates via new Vagrantfiles, not that we’ve done that to date…

By the by, I’m running vagrant with a handful of plugins:

#Speed up repeated builds
vagrant plugin install vagrant-cachier

#Use correct Virtualbox Guest Additions
vagrant plugin install vagrant-vbguest

#Help with provisioning to virtual hosts
vagrant plugin install vagrant-digitalocean
vagrant plugin install vagrant-linode
vagrant plugin install vagrant-aws

The following Vagrantfile builds the local Virtualbox instance by default. To build to DOgital Ocean or Linode, use the following:

  • vagrant up --provider=digital_ocean
  • vagrant up --provider=linode

I didn’t get the AWS vagrant provisioner to work (too many things to go wrong in terms of settings!)

The Linode build also required a hack to get the box to build correctly…

# ./quickbuild/Vagrantfile

#Vagrantfile for building machine from build scripts

Vagrant.configure("2") do |config|

#------------------------- PROVIDER: VIRTUALBOX (BUILD) ------------------------------

  config.vm.provider :virtualbox do |virtualbox|

      #ubuntu/xenial bug? https://bugs.launchpad.net/cloud-images/+bug/1569237
      config.vm.box = "bento/ubuntu-16.04"
      #Stick with the default key
      config.ssh.insert_key=false

      #For local testing:
      #config.vm.box = "tm351basebuild"
      #override.vm.box_url = "eg URL on dropbox"
      #config.vm.box_url = "../boxes/test.box"

      config.vm.hostname = "tm351base"

      virtualbox.name = "tm351basebuildbuild"
      #We need the memory to install scipy and build indexes on seeded mongodb
      #After the build it can be reduced back down to 1024
      virtualbox.memory = 2048
      #virtualbox.cpus = 1
      # virtualbox.gui = true

      #---- START PORT FORWARDING ----
      #Registered ports: https://en.wikipedia.org/wiki/List_of_TCP_and_UDP_port_numbers
      #openrefine
      config.vm.network :forwarded_port, guest: 3334, host: 35101, auto_correct: true

      #---- END PORT FORWARDING ----
    end

#------------------------- END PROVIDER: VIRTUALBOX (BUILD) ------------------------------

#------------------------- PROVIDER: DIGITAL OCEAN ------------------------------

config.vm.provider :digital_ocean do |provider, override|
		override.ssh.insert_key=true
        override.ssh.private_key_path = '~/.ssh/id_rsa'
        override.vm.box = 'digital_ocean'
        override.vm.box_url = "https://github.com/devopsgroup-io/vagrant-digitalocean/raw/master/box/digital_ocean.box"
        provider.token = 'YOUR_TOKEN'
        provider.image = 'ubuntu-16-04-x64'
        provider.region = 'lon1'
        provider.size = '2gb'

  end

#------------------------- END PROVIDER: DIGITAL OCEAN ------------------------------

#------------------------- PROVIDER: LINODE ------------------------------

config.vm.provider :linode do |provider, override|
    override.ssh.insert_key=true
    override.ssh.private_key_path = '~/.ssh/id_rsa'
    override.vm.box = 'linode/ubuntu1604'

    provider.api_key = 'YOUR KEY'
    provider.distribution = 'Ubuntu 16.04 LTS'
    provider.datacenter = 'london'
    provider.plan = 'Linode 2048'
    provider.size=2048

    #grub needs updating - but want's to do it interactively
    #this bit of voodoo from Stack Overflow hacks a non-interactive install of it
    override.vm.provision :shell, :inline => <<-SH
    	apt-get update && DEBIAN_FRONTEND=noninteractive apt-get -y -o DPkg::options::="--force-confdef" -o DPkg::options::="--force-confold"  install grub-pc
	SH

  end

#------------------------- END PROVIDER: LINODE ------------------------------

#------------------------- PROVIDER: AWS ------------------------------

  #  I DIDN'T GET THIS TO WORK - MAYBE SEVERAL THINGS WRONG HERE - AND IN AWS SETTINGS ????

  config.vm.provider :aws do |aws, override|
  	config.vm.hostname = "tm351aws"
  	#vagrant box add dummy https://github.com/mitchellh/vagrant-aws/raw/master/dummy.box
    override.vm.box = "dummy"
    aws.access_key_id = ""
    aws.secret_access_key = ""

    #https://github.com/mitchellh/vagrant-aws/issues/405#issuecomment-130342371
    #Download and install the Amazon Command Line Interface
    #http://docs.aws.amazon.com/cli/latest/userguide/installing.html
    #Configure the command line interface
    #http://docs.aws.amazon.com/cli/latest/userguide/cli-chap-getting-started.html
    #$aws configure
    #Request the session token
    #$aws sts get-session-token --duration-seconds 129600 (enter your own duration)
    aws.session_token = ""

    #Keypair also generated via AWS console?
    aws.keypair_name = "vagrantAWSkeypair"

    aws.region = "eu-west-2a"
    aws.ami = "ami-ed908589"
    aws.instance_type="t2.small"

    override.ssh.username = "ubuntu"
    override.ssh.private_key_path =  '~/.ssh/id_rsa'

  end

  # NOTE THAT RUNNING THIS PROVISIONER MAY LEAVE THINGS BILL INCURRING ON AWS... SO CHECK

#------------------------- END PROVIDER: AWS ------------------------------

#------------------------------

  config.vm.provision :shell, :inline => <<-SH
  	#Add build scripts here
  	cd /vagrant/build
  	source ./quick_build.sh
  SH

end

(The vagrant script can be tidied to hide keys by setting eg export DIGITAL_OCEAN_TOKEN="YOUR TOKEN HERE" from the command line you call vagrant from, and in the Vagrantfile setting provider.token = ENV['DIGITAL_OCEAN_TOKEN']).)

One of the nice things about the current version of vagrant is that you have to destroy a machine before launching another one of the same name with a different provisioners (though this looks set to change in forthcoming versions of vagrant). Why nice? Because the vagrant destroy command kills the node the machine is running on – so it won’t be left running and you won’t forget to turn it off (and won’t keep the meter running….)

Firing up the boxes on various hosts, go to port 3334 at the appropriate IP address and you should see OpenRefine running there…

Having failed to get the machine up and running on AWS, I though I’d try the simple route of packaging an AMI using Packer.

The build script was remarkably simple – once I got one that worked!

#awspacker.json

{
  "variables": {
    "aws_access_key": "",
    "aws_secret_key": ""
  },
  "builders": [{
    "type": "amazon-ebs",
    "access_key": "{{user `aws_access_key`}}",
    "secret_key": "{{user `aws_secret_key`}}",
    "region": "eu-west-1",
    "source_ami": "ami-971238f1",
    "instance_type": "t2.micro",
    "ssh_username": "ubuntu",
    "ami_name": "openrefine",
    "security_group_id": "OPTIONAL_YOUR_VAGRANT_GROUP"
  }],

  "provisioners": [

    {
      "destination": "/tmp/",
      "source": "./toupload/",
      "type": "file"
    },
    {
      "inline": [
        "cd /tmp && sudo apt-get update && sudo apt-get install unzip && sudo unzip /tmp/quickbuild.zip -d /tmp && sudo chmod ugo+x /tmp/quickbuild/quick_build.sh && sudo /tmp/quickbuild/quick_build.sh "
      ],
      "type": "shell"
    }
  ]

}

(The eu-west-2 (London) region wasn’t recognised by Packer for some reason…)

The machine can now be built on AWS and packaged as an AMI using Packer as follows (top level security tokens can be generated from the AWS Security Credentials console):

#Package the build files
mkdir -p toupload && zip -r toupload/quickbuild.zip quickbuild -x *.vagrant* -x *.DS_Store -x *.git* -x *.ipynb_checkpoints*

#Pack the machine
packer build -var 'aws_access_key=YOUR_KEY' -var 'aws_secret_key=YOUR_SECRET' awspacker.json

Launching an instance of this AMI, I found that I couldn’t connect to the OpenRefine port (it just hung). The fix was to amend the automatically created security group rules (which by default just allow ssh on port 22) with a a Custom TCP rule that allowed incoming traffic on port 3334 from All Domains.

Which meant success:

To simplify matters, I then copied this edited security group to my own “openrefine” security group that I could use as the basis of the AMI packaging.

Just one thing to note about creating an AMI – Amazon will start billing you for it… As the Packer Getting Started guide suggests:

After running the above example, your AWS account now has an AMI associated with it. AMIs are stored in S3 by Amazon, so unless you want to be charged about $0.01 per month, you’ll probably want to remove it. Remove the AMI by first deregistering it on the AWS AMI management page. Next, delete the associated snapshot on the AWS snapshot management page.

Next up, I need to try a full build of the TM351 VM on AWS (a full build without the Mongo shard activity (which I couldn’t get to work yesterday – though this looks like it could provide a handy helper script (and I maybe also need to work through this.) The fuller build seems fine from the vagrant script in Virtualbox, Digital Ocean and Linode.

After that (and fixing the Mongo sharding thing), I’ll see if I can weave the build scripts into a set of interconnected Docker containers, one Dockerfile per application and a docker-compose.yml to weave them together. (See the original test from way back when.)

And then there’ll just be the look-see to see whether we can get the machine built and running on a Raspberry Pi 3 model B.

I also started wondering about whether I should pop a simple Flask app into the VM on port 80, showing an OU splash screen and a “Welcome to TM351” message… If I can get that running, then we have a means of piping stuff into a web page on the students’ own machines that is completely out of the controlling hands of LTS:-)

PS for an example of how to set up authentication over these services, see: Simple Authenticated Access to VM Services Using NGINX and Vagrant Port Forwarding.

Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying features (or storypoints) in the data that might provide a basis for making interesting text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the next post in this series, but for now, let’s consider the following question:how can we group numbers that are nearly the same? For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just nearly, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a maximum distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange
        if abs(x - groups[-1][-1]) <= maxgap and inmaxrange:
            groups[-1].append(x)
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41))
[[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are really close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange
        inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange
        if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x]
trythis[0] if len(trythis) else ''
[2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised pandas fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    return x

def numclustergroup2(x,col,maxgap,maxrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum()
    return x.drop('cdiff',1)

def numclustergroup3(x,col,maxgap,maxrange,maxminrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum()
    return x.drop('cdiff',1)

#Test
uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]})
numclustergroup(uu,'y',0.35)
numclustergroup2(uu,'y',0.35,0.41)
numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as True whenever you want to increase the group number.

A First Attempt at Wrangling WRC (World Rally Championship) Data With pandas and matplotlib

Last year was a quite year on the Wrangling F1 Data With R front, with a not even aborted start at doing a python/pandas equivalent project. With the new owners of F1 in place, things may change for the better in terms of engaging with fans and supporters, and I may revisit that idea properly, but in the meantime, I thought I started tinkering with a wider range of motorsport data.

The start to the BTCC season is still a few months away, but the WRC started over the weekend, and with review highlights and live coverage of one stage per rally on Red Bull TV, I thought I may give that data a go…

Results and timing info can be found on the WRC web pages (I couldn’t offhand find a source of official FIA timing sheets) so here’s a first quick sketch using stage results from the first rally of the year – Monte Carlo.

world_rally_championship_-_results_monte_carlo_-_wrc_com

To start with, we need to grab the data. I’m using the pandas library, which has a handy .read_html() method that can scrape tables (crudely) from an HTML page given its URL.

import pandas as pd

def getStageResultsBase(year,rallyid,stages):
    ''' Get stage results and overall results at end of stage '''
    
    # Accept one stage number or a list of stage numbers
    stages=[stages] if not isinstance(stages,list) else stages
    
    #There are actually two tables on the stage results page
    df_stage=pd.DataFrame()
    df_overallpart=pd.DataFrame()
    
    #Grab data for each stage
    for stage in stages:
        url='http://www.wrc.com/live-ticker/daten/{year}/{rallyid}/stage.{rallyid}.{stage}.all.html'.format(year=year, rallyid=rallyid, stage=stage)
        #scrape the data
        results=pd.read_html(url,encoding='utf-8')
        results[0].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        results[1].columns=['pos', 'carNo', 'driverName', 'time', 'diffPrev', 'diffFirst']
        
        #Simple cleaning - cast the data types as required
        for i in [0,1]:
            results[i].fillna(0,inplace=True)
            results[i]['pos']=results[i]['pos'].astype(float).astype(int)
            for j in ['carNo','driverName','time','diffPrev','diffFirst']:
                results[i][j]=results[i][j].astype(str)
        
        #Add a stage identifier
        results[0]['stage']=stage
        results[1]['stage']=stage
        
        #Add the scraped stage data to combined stage results data frames
        df_stage=pd.concat([df_stage,results[0]])
        df_overallpart=pd.concat([df_overallpart,results[1]])

    return df_stage.reset_index(drop=True), df_overallpart.reset_index(drop=True)

The data we pull back looks like the following.

wrc_results_scraper1

Note that deltas (the time differences) are given as offset times in the form of a string. As the pandas library was in part originally developed for working with financial time series data, it has a lot of support for time handling. This includes the notion of a time delta:

pd.to_timedelta("1:2:3.0")
#Timedelta('0 days 01:02:03')

We can use this datatype to represent time differences from the results data:

#If we have hh:mm:ss format we can easily cast a timedelta
def regularTimeString(strtime):

    #Go defensive, just in case we're passed eg 0 as an int
    strtime=str(strtime)
    strtime=strtime.strip('+')

    modifier=''
    if strtime.startswith('-'):
        modifier='-'
        strtime=strtime.strip('-')

    timeComponents=strtime.split(':')
    ss=timeComponents[-1]
    mm=timeComponents[-2] if len(timeComponents)>1 else 0
    hh=timeComponents[-3] if len(timeComponents)>2 else 0
    timestr='{}{}:{}:{}'.format(modifier,hh,mm,ss)
    return pd.to_timedelta(timestr)

We can use the time handler to cast the time differences from the scraped data as timedelta typed data:

def getStageResults(year,rallyid,stages):
    df_stage, df_overallpart = getStageResultsBase(year,rallyid,stages)
    for col in ['time','diffPrev','diffFirst']:
        df_stage['td_'+col]=df_stage.apply(lambda x: regularTimeString(x[col]),axis=1)
        df_overallpart['td_'+col]=df_overallpart.apply(lambda x: regularTimeString(x[col]),axis=1)
    return df_stage, df_overallpart 

wrc_results_scraper2

The WRC results cover all entrants to the rally, but not all the cars are classed as fully blown WRC cars (class RC1). We can limit the data to just the RC1 cars and generate a plot showing the position of each driver at the end of each stage:

%matplotlib inline
import matplotlib.pyplot as plt

rc1=df_overall[df_overall['groupClass']=='RC1'].reset_index(drop=True)

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='pos',ax=ax,legend=None);

wrc_results_scraper3

The position is actually the position of the driver across all entry classes, not just RC1. This means if a driver has a bad day, they could be placed well down the all-class field; but that’s not of too much interest if all we’re interested in is in-class ranking.,

So what about if we rerank the drivers within the RC1 class? And perhaps improve the chart a little by adding a name label to identify each driver at their starting position?

rc1['rank']=rc1.groupby('stage')['pos'].rank()

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='rank',ax=ax,legend=None)

#Add some name labels at the start
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, i+1, d.ix(i)['driverName'])

wrc_results_scraper4

This chart is a bit cleaner, but now we lose information around the lower placed in-class drivers, in particular that information about  there overall position when other classes are taken into account too…

The way the FIA recover this information in their stage chart displays that reports on the evolution of the race for the top 10 cars overall (irrespective of class)  that shows excursions in interim stages outside the top 10  “below the line”, annotating them further with their overall classification on the corresponding stage.

stage_chart___federation_internationale_de_l_automobile

We can use this idea by assigning a “re-rank” to each car if they are positioned outside the size of the class.

#Reranking...
rc1['xrank']= (rc1['pos']>RC1SIZE)
rc1['xrank']=rc1.groupby('stage')['xrank'].cumsum()
rc1['xrank']=rc1.apply(lambda row: row['pos'] if row['pos']<=RC1SIZE else row['xrank'] +RC1SIZE, axis=1)
fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None)

#Name labels
for i,d in rc1[rc1['stage']==1].iterrows():
    ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName'])
for i,d in rc1[rc1['stage']==17].iterrows():
    ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])

wrc_results_scraper5The chart now shows the evolution of the race for the RC1 cars, retaining the spaced ordering of the top 12 positions that would be filled by WRC1/RC1 cars if they were all placed above cars from other classes and then bunching those placed outside the group size. (Only 11 names are shown because one the entries retired right at the start of the rally.)

So for example, in this case we see how Neuvill, Hanninen and Serderidis are classed outside Lefebvre, who was actually classed 9th overall.

Further drawing on the FIA stage chart, we can label the excursions outside the top 12, and also invert the y-axis.

fig, ax = plt.subplots(figsize=(15,8))
ax.get_yaxis().set_ticklabels([])
rc1.groupby('driverName').plot(x='stage',y='xrank',ax=ax,legend=None);

for i,d in rc1[rc1['xrank']>RC1SIZE].iterrows(): ax.text(d.ix(i)['stage']-0.1, d.ix(i)['xrank'], d.ix(i)['pos'], bbox=dict( boxstyle='round,pad=0.3',color='pink')) #facecolor='none',edgecolor='black',
#Name labels
for i,d in rc1[rc1['stage']==1].iterrows(): ax.text(-0.5, d.ix(i)['xrank'], d.ix(i)['driverName']) for i,d in rc1[rc1['stage']==17].iterrows(): ax.text(17.3, d.ix(i)['xrank'], d.ix(i)['driverName'])
#Flip the y-axis plt.gca().invert_yaxis()

Lefebvre’s excursions outside the top 12 are now recorded and plain to see.

wrc_results_scraper6

We now have a chart that blends rank ordering with additional information showing where cars are outpaced by cars from other classes, in a space efficient manner.

PS as with Wrangling F1 Data With R, I may well turn this into a Leanpub book, this time exploring the workflow to markdown (and/or maybe reveal.js slides!) from Jupyter notebooks, rather than from RStudio/Rmd.

A Recipe for Automatically Going From Data to Text to Reveal.js Slides

Over the last few years, I’ve experimented on and off with various recipes for creating text reports from tabular data sets, (spreadsheet plugins are also starting to appear with a similar aim in mind). There are several issues associated with this, including:

  • identifying what data or insight you want to report from your dataset;
  • (automatically deriving the insights);
  • constructing appropriate sentences from the data;
  • organising the sentences into some sort of narrative structure;
  • making the sentences read well together.

Another approach to humanising the reporting of tabular data is to generate templated webpages that review and report on the contents of a dataset; this has certain similarities to dashboard style reporting, mixing tables and charts, although some simple templated text may also be generated to populate the page.

In a business context, reporting often happens via Powerpoint presentations. Slides within the presentation deck may include content pulled from a templated spreadsheet, which itself may automatically generate tables and charts for such reuse from a new dataset. In this case, the recipe may look something like:

exceldata2slide

#render via: http://blockdiag.com/en/blockdiag/demo.html
{
  X1[label='macro']
  X2[label='macro']

  Y1[label='Powerpoint slide']
  Y2[label='Powerpoint slide']

   data -> Excel -> Chart -> X1 -> Y1;
   Excel -> Table -> X2 -> Y2 ;
}

In the previous couple of posts, the observant amongst you may have noticed I’ve been exploring a couple of components for a recipe that can be used to generate reveal.js browser based presentations from the 20% that account for the 80%.

The dataset I’ve been tinkering with is a set of monthly transparency spending data from the Isle of Wight Council. Recent releases have the form:

iw_transparency_spending_data

So as hinted at previously, it’s possible to use the following sort of process to automatically generate reveal.js slideshows from a Jupyter notebook with appropriately configured slide cells (actually, normal cells with an appropriate metadata element set) used as an intermediate representation.

jupyterslidetextgen

{
  X1[label="text"]
  X2[label="Jupyter notebook\n(slide mode)"]
  X3[label="reveal.js\npresentation"]

  Y1[label="text"]
  Y2[label="text"]
  Y3[label="text"]

  data -> "pandas dataframe" -> X1  -> X2 ->X3
  "pandas dataframe" -> Y1,Y2,Y3  -> X2 ->X3

  Y2 [shape = "dots"];
}

There’s an example slideshow based on October 2016 data here. Note that some slides have “subslides”, that is, slides underneath them, so watch the arrow indicators bottom left to keep track of when they’re available. Note also that the scrolling is a bit hit and miss – ideally, a new slide would always be scrolled to the top, and for fragments inserted into a slide one at a time the slide should scroll down to follow them).

The structure of the presentation is broadly as follows:

demo_-_interactive_shell_for_blockdiag_-_blockdiag_1_0_documentation

For example, here’s a summary slide of the spends by directorate – note that we can embed charts easily enough. (The charts are styled using seaborn, so a range of alternative themes are trivially available). The separate directorate items are brought in one at a time as fragments.

testfullslidenotebook2_slides1

The next slide reviews the capital versus expenditure revenue spend for a particular directorate, broken down by expenses type (corresponding slides are generated for all other directorates). (I also did a breakdown for each directorate by service area.)

The items listed are ordered by value, and taken together account for at least 80% of the spend in the corresponding area. Any further items contributing more than 5%(?) of the corresponding spend are also listed.

testfullslidenotebook2_slides2

Notice that subslides are available going down from this slide, rather than across the mains slides in the deck. This 1.5D structure means we can put an element of flexible narrative design into the presentation, giving the reader an opportunity to explore the data, but in a constrained way.

In this case, I generated subslides for each major contributing expenses type to the capital and revenue pots, and then added a breakdown of the major suppliers for that spending area.

testfullslidenotebook2_slides3

This just represents a first pass at generating a 1.5D slide deck from a tabular dataset. A Pareto (80/20) heurstic is used to try to prioritise to the information displayed in order to account for 80% of spend in different areas, or other significant contributions.

Applying this principle repeatedly allows us to identify major spending areas, and then major suppliers within those spending areas.

The next step is to look at other ways of segmenting and structuring the data in order to produce reports that might actually be useful…

If you have any ideas, please let me know via the comments, or get in touch directly…

PS FWIW, it should be easy enough to run any other dataset that looks broadly like the example at the top through the same code with only a couple of minor tweaks…

Automatically Generating Two Dimensional Reveal.js Documents Using Jupyter Notebooks

One of the things I finally got round to exploring whilst at the Reproducible Research Using Jupyter Notebooks curriculum development hackathon was the ability to generate slideshows from Jupyter notebooks.

The underlying slideshow presentation framework is reveal.js. This uses a 1.5(?) dimensional slide geometry, so slides can transition left to right, or you can transition down to subslides off a single slide.

This got me wondering… could I use a notebook/script to generate a reveal.js slideshow that could provide a convenient way of navigating automatically generated slideshows made up from automatically generated data2text slides?

The 1.5/two dimensional component would mean that slides could be structured by topic horizontally, with subtopic vertically downwards within a topic.

A quick test suggests that this is absolutely doable…

import IPython.nbformat as nb
import IPython.nbformat.v4.nbbase as nb4

test=nb4.new_notebook()
test.cells.append(nb4.new_markdown_cell('# Test slide1',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('# Test slide2',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide2 extra content line 1\n\nSlide2 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Test slide3',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 1',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('Slide3 fragment 2',metadata={"slideshow": {"slide_type": "fragment"}}))
test.cells.append(nb4.new_markdown_cell('# Slide4',metadata={"slideshow": {"slide_type": "slide"}}))
test.cells.append(nb4.new_markdown_cell('Slide4 extra content line 1\n\nSlide4 extra content line 2'))
test.cells.append(nb4.new_markdown_cell('# Slide4 subslide',metadata={"slideshow": {"slide_type": "subslide"}}))

nbf='testslidenotebook.ipynb'
nb.write(test,nbf)

#Generate and render slideshow
!jupyter nbconvert $nbf --to slides --post serve

Let the fun begin…:-)

PS here’s a second pass:

def addSlideComponent(notebook, content, styp=''):
    if styp in ['slide','fragment','subslide']: styp={"slideshow": {"slide_type":styp}}
    else: styp={}
    notebook.cells.append(nb4.new_markdown_cell(content, metadata=styp))

test=nb4.new_notebook()
addSlideComponent(test,'# Test2 slide1','slide')
addSlideComponent(test,'# Test slide2','slide')
addSlideComponent(test,'Slide2 extra content line 1\n\nSlide2 extra content line 2')
addSlideComponent(test,'# Test slide3','slide')
addSlideComponent(test,'Slide3 fragment 1','fragment')
addSlideComponent(test,'Slide3 fragment 2','fragment')
addSlideComponent(test,'# Slide4','slide')
addSlideComponent(test,'Slide4 extra content line 1\n\nSlide2 extra content line 1')
addSlideComponent(test,'# Slide4 subslide','subslide')

nbf='testslidenotebook2.ipynb'
nb.write(test,nbf)

Wrangling Time Periods (such as Financial Year Quarters) In Pandas

Looking at some NHS 111 and A&E data today, the reported data I was interested in was being reported for different sorts of period, specifically, months and quarters.

The pandas python library has quite a few tools for dealing with periods, so here are a couple of examples of tricks I put to use today.

Firstly, casting months to a month period. Monthly periods (in column df['Periodname']) were reported in the form “Dec-10”, “Jan-11”, etc, which is to say a three letter month followed by a two digit year representation. If we create a datetime on the first of the corresponding month, we can then cast that date to a month period for the corresponding month:

#Generate a pandas period for time series indexing purposes
#Create a date from each month (add the first date of the month) then set to period
#Note: if the datetime is an index, drop the .dt.
df['_period']=pd.to_datetime(('01-'+df['Periodname']), \
                             format='%d-%b-%y').dt.to_period('M')

If the datetimes form the dataframe index, we could drop the .dt. element and cast the timeseries index directly to the monthly period. The %b term maps onto the locale specific month description.

The second example had financial years and dates in two columns – Year and Quarter. The months specified in the quarter description mix three letter representations and the fully written name of the month. (As with many NHS datasets, the text used to describe dates can be all over the place, even within a single spreadsheet or datafile!)

If we grab the first three letters of the month, we can identify the month as it’s short form. If we create a date stamp for the first day of the quarter (in the UK, corporate financial years run 1-April, to 31-March (it’s the personal tax year that reports from April 5th?) we can then cast to a quarterly reporting period whose year ends in a particular month, eg March (freq='Q-MAR').

nhs_111

Note that the YearQuarter uses the calendar year in which the financial year ends. We can check the start and end date of the quarters explicitly:

nhs_111_2

One issue when it comes to plotting the financial year quarterly periods is that the chart looks to me as if everything is a year out:

nhs_111_3

That is, I read the year as the calendar year, but it’s actually the calendar year in which a particular financial year ends.

So I’m wondering – should pandas really report and plot the financial year in which the fiscal year starts? Is there a switch for this? Or do I need to submit a bug report (and if so, what’s the bug?)

PS for a notebook worked example, see here: NHS 111 example.