Category: Tinkering

Running a Shell Script Once Only in vagrant

Via somewhere (I’ve lost track of the link), here’s a handy recipe for running a shell script once and once only from Vagrantfile.

In the shell script (


if [ ! -f ~/runonce ]


  touch ~/runonce

In the Vagrantfile:

  config.vm.provision :shell, :inline => <<-SH
    chmod ugo+x /vagrant/

Exporting and Distributing Docker Images and Data Container Contents

Although it was a beautiful day today, and I should really have spent it in the garden, or tinkering with F1 data, I lost the day to the screen and keyboard pondering various ways in which we might be able to use Kitematic to support course activities.

One thing I’ve had on pause for some time is the possibility of distributing docker images to students via a USB stick, and then loading them into Kitematic. To do this we need to get tarballs of the appropriate images so we could then distribute them.

docker save psychemedia/openrefine_ou:tm351d2test | gzip -c > test_openrefine_ou.tgz
docker save psychemedia/tm351_scipystacknserver:tm351d3test | gzip -c > test_ipynb.tgz
docker save psychemedia/dockerui_patch:tm351d2test | gzip -c > test_dockerui.tgz
docker save busybox:latest | gzip -c > test_busybox.tgz
docker save mongo:latest | gzip -c > test_mongo.tgz
docker save postgres:latest | gzip -c > test_postgres.tgz

On the to do list is getting to these to with the portable Kitematic branch (I’m not sure if that branch will continue, or whether the interest is too niche?!), but in the meantime, I could load it into the Kitematic VM from the Kitematice CLI using:

docker load < test_mongo.tgz

assuming the test_mongo.tgz file is in the current working directory.

Another I need to explore is how to get the set up the data volume containers on the students’ machine.

The current virtual machine build scripts aim to seed the databases from raw data, but to set up the student machines it would seem more sensible to either rebuild a database from a backup, or just load in a copy of the seeded data volume container. (All the while we have to be mindful of providing a route for the students to recreate the original, as distributed, setup, just in case things go wrong. At the same time, we also need to start thing about backup strategies for the students so they can checkpoint their own work…)

The traditional backup and restore route for PostgreSQL seems to be something like the following:

#Use docker exec to run a postgres export
docker exec -t vagrant_devpostgres_1 pg_dumpall -Upostgres -c &gt; dump_`date +%d-%m-%Y"_"%H_%M_%S`.sql
#If it's a large file, maybe worth zipping: pg_dump dbname | gzip > filename.gz

#The restore route would presumably be something like:
cat postgres_dump.sql | docker exec -i vagrant_devpostgres_1 psql -Upostgres
#For the compressed backup: cat postgres_dump.gz | gunzip | psql -Upostgres

For mongo, things seem to be a little bit more complicated. Something like:

docker exec -t vagrant_mongo_1 mongodump

#Complementary restore command is: mongorestore

would generate a dump in the container, but then we’d have to tar it and get it out? Something like these mongodump containers may be easier? (mongo seems to have issues with mounting data containers on host, on a Mac at least?

By the by, if you need to get into a container within a Vagrant launched VM (I use vagrant with vagrant-docker-compose), the following shows how:

#If you need to get into a container:
vagrant ssh
#Then in the VM:
  docker exec -it CONTAINERNAME bash

Another way of getting to the data is to export the contents of the seeded data volume containers from the build machine. For example:

#  Export data from a data volume container that is linked to a database server

docker run --volumes-from vagrant_devpostgres_1 -v $(pwd):/backup busybox tar cvf /backup/postgresbackup.tar /var/lib/postgresql/data 

#I wonder if these should be run with --rm to dispose of the temporary container once run?

docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db

We can then take the tar file, distribute it to students, and use it to seed a data volume container.

Again, from the Kitematic command line, I can run something like the following to create a couple of data volume containers:

#Create a data volume container
docker create -v /var/lib/postgresql/data --name devpostgresdata busybox true
#Restore the contents
docker run --volumes-from devpostgresdata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/postgresbackup.tar"
#Note - the docker helpfiles don't show how to use sh -c - which appears to be required...
#Again, I wonder whether this should be run with --rm somewhere to minimise clutter?

Unfortunately, things don’t seem to run so smoothly with mongo?

#Unfortunately, when trying to run a mongo server against a data volume container
#the presence of a mongod.lock seems to break things
#We probably shouldn't do this, but if the database has settled down and completed
#  all its writes, it should be okay?!
docker run --volumes-from vagrant_mongo_1 -v $(pwd):/backup busybox tar cvf /backup/mongobackup.tar /data/db --exclude=*mongod.lock
#This generates a copy of the distributable file without the lock...

#Here's an example of the reconstitution from the distributable file for mongo
docker create -v /data/db --name devmongodata busybox true
docker run --volumes-from devmongodata -v $(pwd):/backup ubuntu sh -c "tar xvf /backup/mongobackup.tar"

(If I’m doing something wrong wrt the getting the mongo data out of the container, please let me know… I wonder as well with the cavalier way I treat the lock file whether the mongo container should be started up in repair mode?!)

If have a docker-compose.yml file in the working directory like the following:

  image: mongo
    - "27017:27017"
    - devmongodata

##We DO NOT need to declare the data volume here
#We have already created it
#Also, if we leave it in, a "docker-compose rm" command
#will destroy the data volume container...
#...which means we wouldn't persist the data in it
#    command: echo created
#    image: busybox
#    volumes: 
#        - /data/db

We can the run docker-compose up and it should fire up a mongo container and link it to the seeded data volume container, making the data contains in that data volume container available to us.

I’ve popped some test files here. Download and unzip, from the Kitematic CLI cd into the unzipped dir, create and populate the data containers as above, then run: docker-compose up

You should be presented with some application containers including OpenRefine and an OU customised IPython notebook server. You’ll need to mount the IPython notebooks folder onto the unzipped folder. The example notebook (if everything works!) should show demonstrate calls to prepopulated mongo and postgres databases.


Doodling With 3d Animated Charts in R

Doodling with some Gapminder data on child mortality and GDP per capita in PPP$, I wondered whether a 3d plot of the data over the time would show different trajectories over time for different countries, perhaps showing different development pathways over time.

Here are a couple of quick sketches, generated using R (this is the first time I’ve tried to play with 3d plots…)

#data downloaded from Gapminder
#wb=loadWorkbook("indicator gapminder gdp_per_capita_ppp.xlsx")

#Set up dataframes
gdp=read.xlsx("indicator gapminder gdp_per_capita_ppp.xlsx", sheetName = "Data")
mort=read.xlsx("indicator gapminder under5mortality.xlsx", sheetName = "Data")

#Tidy up the data a bit

gdpm=melt(gdp,id.vars = 'GDP.per.capita','year')
gdpm$year = as.integer(gsub('X', '', gdpm$year))
gdpm=rename(gdpm, c("GDP.per.capita"="country", "value"="GDP.per.capita"))

mortm=melt(mort,id.vars = 'Under.five.mortality','year')
mortm$year = as.integer(gsub('X', '', mortm$year))
mortm=rename(mortm, c("Under.five.mortality"="country", "value"="Under.five.mortality"))

#The following gives us a long dataset by country and year with cols for GDP and mortality

#Filter out some datasets by country[gdpmort['country']=='United States',][gdpmort['country']=='Bangladesh',][gdpmort['country']=='China',]

Now let’s have a go at some charts. First, let’s try a static 3d line plot using the scatterplot3d package:


s3d = scatterplot3d($year,$Under.five.mortality,$GDP.per.capita, 
                     color = "red", angle = -50, type='l', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year")
             col = "purple", type = "l")
             col = "blue", type = "l")

Here’s what it looks like… (it’s worth fiddling with the angle setting to get different views):


A 3d bar chart provides a slightly different view:

s3d = scatterplot3d($year,$Under.five.mortality,$GDP.per.capita, 
                     color = "red", angle = -50, type='h', zlab = "GDP.per.capita",
                     ylab = "Under.five.mortality", xlab = "year",pch = " ")
             col = "purple", type = "h",pch = " ")
             col = "blue", type = "h",pch = " ")


As well as static 3d plots, we can generate interactive ones using the rgl library.

Here’s the code to generate an interactive 3d plot that you can twist and turn with a mouse:

#Get the data from required countries - data cols are GDP and child mortality
x.several = gdpmort[gdpmort$country %in% c('United States','China','Bangladesh'),]

plot3d(x.several$year, x.several$Under.five.mortality,  log10(x.several$GDP.per.capita),
       col=as.integer(x.several$country), size=3)

We can also set the 3d chart spinning….

play3d(spin3d(axis = c(0, 0, 1)))

We can also grab frames from the spinning animation and save them as individual png files. If you have Imagemagick installed, there’s a function that will generate the image files and weave them into an animated gif automatically.

It’s easy enough to install on a Mac if you have the Homebrew package manager installed. On the command line:

brew install imagemagick

Then we can generate a movie:

movie3d(spin3d(axis = c(0, 0, 1)), duration = 10,
        dir = getwd())

Here’s what it looks like:



A Quick Look at Planning Data on the Isle of Wight

One of the staples that I suspect many folk look to in our weekly local paper, the Isle of Wight local press, is the listing of recent planning notices.

The Isle of Wight Council website also provides a reasonably comprehensive online source about planning information. Notices are split across several listings:

It’s easy enough to knock up a scraper to grab the list of current applications, scrape each of the linked to application pages in turn, and then generate a map showing the locations of the current planning applications.


Indeed, working with my local hyperlocal, here’s a sketch of exactly such an approach, published for the first time yesterday: Isle of Wight planning applications : Mapped (announcement).

I’m hoping to do a lot more with OnTheWight – and perhaps others…? – over the coming weeks and months, so it’d great to hear any feedback you have either here, or on the OnTheWight site itself.

Where Next?

The sketch is a good start, but it’s exactly that. If we are going to extend the service, for example, by also providing a means of reviewing recently accepted (or rejected) applications, as well as applications currently under appeal, we perhaps need to think a little bit more clearly about how we store the data – and keep track of where it is in the planning process.

If we look at the page for a particular application, we see that there are essentially three tables:


The listings pages also take slightly different forms. All of them have an address, and all of them have a planning application identification number (though in two forms, albeit intersecting); but they differ in terms of the semantics of the third and possible fourth columns, although each ultimately resolves to a date or null value.

– current (and archive) listings:


– recent decisions:


– appeals:


At the moment, the OnTheWight sketchmap is generated from a scrape of the Isle of Wight Council current planning applications page (latitude and longitude are generated by geocoding the address). A more complete solution would be to start to build a database of all applications, though this requires a little bit of thought when it comes to setting up the database so it becomes possible to track the current state of a particular application.

It might also be useful to put together a simple flow chart that shows how the public information available around an application evolves as an application progresses and then build a data model that can readily reflect that. We could then start to annotate that chart with different output opportunities – for example, as the map goes, it’s easy enough to imagine several layers: a current applications layer, a (current) appeals layer, a recent decisions layer, an archived decision layer.

A process diagram would also allow us to start spotting event opportunities around which we might be able to generate alerts. For example, generating feeds that that allow you to identify changes in application activity within a particular unit postcode or postcode district (ONS: UK postcode structure) or ward could act as the basis of a simple alerting mechanism. It’s then easy enough to set up an IFTT feed to email pipe, though longer term an “onsite” feed to email subscription service would allow for a more local service. (Is there a WordPress plugin that lets logged in users generate multiple email subscriptions to different feeds?

In terms of other value-adds that arise from processing the data, I can think of a few… For example, keeping track of repeated applications to the same property, analysing which agents are popular in terms of applications (and perhaps generating a league table of success rates!), linkage to other location based services (for example, license applications or prices paid data) and so on.

Takes foot off spade and stops looking into the future, surveys weeds and half dug hole…;-)

Things I Take for Granted #287 – Grabbing Stuff from Web Form Drop Down Lists

Over the years, I’ve collected lots of little hacks for tinkering with various data sets. Here’s an example…

A form on a web page with country names that map to code values:


If we want to generate a two column look up table from the names on the list to the values that encode them, we can look to the HTML source, grab the list of elements, then use a regular expression to to extract the names and values and rewrite them in two column, tab separated text file, with one item per line:

regexp form exractor

NOTE: the last character in the replace is \n (newline character). I grabbed the screenshot when the cursor was blinking on:-(

A Google Spreadsheets View Over DWP Tabulation Tool 1-Click Time Series Data

Whilst preparing for an open data training session for New Economy in Manchester earlier this week, I was introduced to the DWP tabulation tool that provides a quick way of analysing various benefits and allowances related datasets, including bereavement benefits, incapacity benefit and employment and support allowance.

The tool supports the construction of various data views as well as providing 1-click link views over “canned” datasets for each category of data.


The data is made available in the form on an HTML data table via a static URL (example):


To simplify working with data, we can import the data table directly into Google spreadsheets using the importHTML() formula, which allows you to specify a URL, and then import a specified HTML data table from that page. In the following example, the first table from a results page – that contains the description of the table – is imported into cell A1, and the actual datatable (table 2) is imported via an importhtml() formula specified in cell A2.


Note that the first data row does not appear to import cleanly – inspection of the original HTML table shows why – the presence of what is presumably a split cell that declares the name of the timeseries index column along with the first time index value.

To simplify the import of these data tables into a Google Spreadsheet, we can make use of a small script to add an additional custom menu into Google spreadsheets that will import a particular dataset.


The following script shows one way of starting to construct such a set of menus:

function onOpen() {
  var ui = SpreadsheetApp.getUi();
  // Or DocumentApp or FormApp.
  ui.createMenu('DWP Tabs')
      .addSubMenu(ui.createMenu('Bereavement Benefits')
          .addItem('1-click BW/BB timeseries', 'mi_bb_b')
          .addItem('1-click Region timeseries', 'mi_bb_r')
          .addItem('1-click Gender timeseries', 'mi_bb_g')
          .addItem('1-click Age timeseries', 'mi_bb_a')
      .addSubMenu(ui.createMenu('Incapacity Benefit/Disablement')
          .addItem('1-click Region timeseries', 'mi_ic_r')

function menuActionImportTable(url){
  var ss = SpreadsheetApp.getActiveSpreadsheet();
  var sheet = ss.getSheets()[0];

  var cell = sheet.getRange("A1");
  cell = sheet.getRange("A2");

//--Incapacity Benefit/Disablement
function mi_ic_r() {
  var url='';

//-- Bereavement Benefits
function mi_bb_r() {
  var url='';

function mi_bb_g() {
  var url='';

function mi_bb_a() {
  var url='';

function mi_bb_b() {
  var url='';

Copying the above script into the script editor associated with a spreadsheet, and then reloading the spreadsheet (permissions may need to be granted to the script the first time it is run), provides a custom menu that allows the direct import of a particular dataset:


Duplicating the spreadsheet carries the script along with it (I think) and can presumably also be shared… (It’s been some time since I played with apps script – I’m not sure how permissioning works or how easy it is to convert scripts to add-ons, though I note from the documentation that top-level app menus aren’t supported by add-ons.

Lazy Regular Expressions – Splitting Out Collapsed Columns

Via a tweet, and then an email, to myself and fellow OpenRefine evengelist, Owen Stephens (if you haven’t already done so, check out Owen’s wonderful OpenRefine tutorial), Dom Fripp got in touch with a data cleaning issue he was having to contend with: a reporting system that threw out a data report in which one of the columns contained a set of collapsed columns from another report. So something rather like this:

TitleoffirstresearchprojectPeriod: 31/01/04 → 31/01/07Number of participants: 1Awarded date: 22 Aug 2003Budget Account Ref: AB1234Funding organisation: BBSRCTotal award: £123,456Principal Investigator: Goode, Johnny B.Project: Funded Project › Research project

The question was – could this be fixed using OpenRefine, with the compounded data elements split out from the single cell into separate columns of their own?

The fields that appeared in this combined column were variable, (not all of them appeared in each row) but always in the same order. So for example, a total collapsed record might look like:

Funding organisation: BBSRCFunder project reference: AA/1234567/8Total award:

The full list of possible collapsed columns was: Title, School/Department, Period, Number of participants, Awarded Date, Budget Account Ref, Funding Organisation, Funder Project Reference, Total award, Reference code, Principal Investigator, Project

The pattern Appeared to be Column Name: value exept for the Title where there was no colon.

On occasion, a row would contain an exceptional item that did not conform to the pattern:


One way to split out the columns is to use a regular expression. We can parse a column using the “Add column based on this column” action:


If all the columns always appeared in the same order, we could write something like the following GREL regular expression to match each column and it’s associated value:

value.match(/(Title.*)(Period.*)(Number of participants:.*)(Awarded date.*)(Budget Account Ref:.*)(Funding organisation.*)(Total award.*)(Principal Investigator:.*)(Project:.*)/)


To cope with optional elements that don’t appear in our sample (for example, (School\/Department.*)), we need to make each group optional by qualifying it with a ?.

value.match(/(Title.*)?(School\/Department.*)?(Period.*)?(Number of participants:.*)?(Awarded date.*)?(Budget Account Ref:.*)?(Funding organisation.*)?(Funder project reference.+?)?(Total award.*)?(Principal Investigator:.*)?(Project:.*)?/)


However, as the above example shows, using the greedy .* operator means we match everything in the first group. So instead, we need to use a lazy evaluation to match items within a group: .+?

value.match(/(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)


So far so good – but how do we cope with cells that do not start with one of our recognised patterns? This time we need to look for not the expected first pattern in our list:

value.match(/((?!(?:Title)).*)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/)


Having matched groups, how do we split the relevant items into news columns. One way is to introduce a column separator character sequence (such as ::) that we can split on:

forEach(value.match(/((?!(?:Title)).*?)?(Title.+?)?(School\/Department.+?)?(Period.+?)?(Number of participants:.+?)?(Awarded date.+?)?(Budget Account Ref:.+?)?(Funding organisation.+?)?(Funder project reference.+?)?(Total award.+?)?(Principal Investigator:.+?)?(Project:.+?)?/),v,if(v == null," ",v)).join('::')


This generates rows of the form:


We can now split these cells into several columns:


We use the :: sequence as the separator:


Once split, the columns should be regularly arranged. For “rogue” items, they should appear in the first new column – any values appearing in the column might be used to help us identify any further tweaks required to our regular expression.


We now need to do a little more cleaning. For example, tidying up column names:


And then cleaning down each new column to remove the column heading.


As a general pattern, use the column name and an optional colon (NOTE: expression should be :? rather than :+):


To reuse this pattern of operations on future datasets, we can export a description of the transformations applied. Future datasets can then be loaded in to OpenRefine, the operation history pasted in, and the same steps applied. (The following screenshot does not show the operation defined for renaming the new columns or cleaning down them.)


As ever, writing up this post took as long as working out the recipe…

PS Hmmm, I wonder… One way of generalising this further might be to try to match the columns in any order…? Not sure my regexp foo is up to that just at the moment. Any offers?!;-)