Category: Anything you want

Using Vagrant to Launch OpenRefine Running in a Linux VM on Linode

A few days ago I saw Jim Groom having fun getting Sandstorm,io running on VM hosts Linode. I haven’t tried running full VMs on remote servers yet, so I thought have a quick look to see if I could get some chunks of the TM351 VM running on a Linode box using Vagrant.

I chickened out of just running the whole set of course VM build puppet scripts (they really need tidying up and the commented out files clearing away), but instead thought I’d start down a path of trying to reuse the simpler bash files I used for the latest docker attempt.

Cribbing from the Linode docs on Using Vagrant to Manage Linode Environments, I logged into Linode and got myself a shortlived API key, and then from a desktop terminal installed the vagrant-linode plugin – vagrant plugin install vagrant-linode – and got some keys (accepting the defaults): ssh-keygen -b 4096.

Here’s a simple Vagrantfile that launches an Ubuntu Trusty server on Linode and pops a copy of OpenRefine inside it…

Vagrant.configure(2) do |config|
  ## SSH Configuration
  config.ssh.username = 'user'
  config.ssh.private_key_path = '~/.ssh/id_rsa'

  #Server config
  config.vm.provider :linode do |provider, override| = 'linode'
    override.vm.box_url = " linode/raw/master/box/"

    #Linode Settings
    provider.token = 'MY_API_TOKEN'
    provider.distribution = 'Ubuntu 14.04 LTS'
    provider.datacenter = 'london'
    provider.plan = '2048'
    provider.label = 'vagrant-ubuntu-lts'

  config.vm.provision "shell", inline: <<-SHELL
    apt-get clean -y && apt-get -y update && apt-get -y upgrade && apt-get install -y wget unzip openjdk-7-jre-headless && apt-get clean -y
    cd /opt
    #Download OpenRefine
    wget --progress=bar:force -q --no-check-certificate
    #Unpack OpenRefine and tidy up
    tar -xzf openrefine-linux-2.6-rc1.tar.gz  && rm openrefine-linux-2.6-rc1.tar.gz
    mkdir /mnt/refine  
  config.vm.provision "shell", inline: <<-SH
    /opt/openrefine-2.6-rc1/refine -i -d /mnt/refine

Running vagrant up creates a Linode node, builds the VM and gets OpenRefine running. >tt>agrant destroy deletes the node; keeping the node running, or popping it into a suspended mode, leaves the meter running.

Using Jupyter Notebooks to Define Literate APIs

Part of the vision behind the Jupyter notebook ecosystem seems to be the desire to create a literate computing infrastructure that supports “the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components” (Fernando Perez, “Literate computing” and computational reproducibility: IPython in the age of data-driven journalism, 19/4/13).

The notebook approach complements other live document approaches such as the use of Rmd in applications such as RStudio, providing an interactive, editable rendered view of the live document, including inlined outputs, rather than just the source code view.

Notebooks don’t just have to be used for analysis though. A few months ago, I spotted a notebook being used to configure a database system, db-introspection-notebook – my gut reaction to which was to ponder Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?. (A problem with that approach, of course, is that it requires notebook machinery to get started, whereas you might typically want to run configuration scrips in as bare bones a system possible.)

Another post that caught my eye last week on Jupyter Notebooks as RESTful Microservices which uses notebooks to define an API using a new Jupyter Kernel Gateway:

[a] web server that supports different mechanisms for spawning and communicating with Jupyter kernels, such as:

  • A Jupyter Notebook server-compatible HTTP API for requesting kernels and talking the Jupyter kernel protocol with them over Websockets
  • A[n] HTTP API defined by annotated notebook cells that maps HTTP verbs and resources to code to execute on a kernel

Tooling to support the creation of a literate API then, that fully respects Fernando Perez’ description of literate computing?!

At first glance it looks like all the API functions need to be defined within a single notebook – the notebook run by the kernel gateway. But another Jupyter project in incubation allows notebooks to be imported into other notebooks, as this demo shows: Notebooks as Reusable Modules and Cookbooks. Which means that a parent API defining notebook could pull in dependent child notebooks that each define a separate API call.

And because the Jupyter server can talk to a wide range of language kernels, this means the API can implemented using a increasing range of languages (though I think that all the calls will need to be implemented using the same language kernel?). Indeed, the demo code has notebooks showing how to define notebook powered APIs in python and R.


See also: What’s On the Horizon in the Jupyter Ecosystem?

Using Spreadsheets That Generate Textual Summaries of Data – HSCIC

Having a quick peek at a dataset released today by the HSCIC on Accident and Emergency Attendances in England – 2014-15, I noticed that the frontispiece worksheet allowed you to compare the performance of two trusts with each other as well as against a national average. What particularly caught my eye was that the data for each was presented in textual form:


In particular, a cell formula is used to construct a templated sentence based using the selected item as a key on a lookup across tables in the other sheets:

=IF(AND($H$63="*",$H$66="*"),"• Attendances by gender have been suppressed for this provider.",IF($H$63="*","• Males attendance data has been suppressed. Females accounted for "&TEXT(Output!$H$67,"0.0%")&" (or "&TEXT($H$66,"#,##0")&") of all attendances.",IF($H$66="*","• Males accounted for "&TEXT(Output!$H$64,"0.0%")& " (or "&TEXT($H$63,"#,##0")&") of all attendances. Female attendance data has been suppressed.","• Males accounted for "&TEXT(Output!$H$64,"0.0%")& " (or "&TEXT($H$63,"#,##0")&") of all attendances, while "&TEXT(Output!$H$67,"0.0%")&" (or "&TEXT($H$66,"#,##0")&") were female.")))

For each worksheet, it’s easy enough to imagine a textual generator that maps a particular row (that is, the data for a particular NHS trust, for example) to a sentence or two (as per Writing Each Row of a Spreadsheet as a Press Release?).

Having written a simple sentence generator for one row, more complex generators can also be created that compare the values across two rows directly, giving constructions of the form The w in x for y was z, compared to r in p for q, for example.

So I wonder, has HSCIC been doing this for some time, and I just haven’t noticed? How about ONS? And are they also running data powered conversational Slack bots too?

Going Round in Circles… or Iterating?

Listening to F1 technical pundit Gary Anderson on a 2014 panel (via Joe Saward) about lessons from F1 for business, I was struck by his comment that “motor racing is about going round in drivers go round in circles all day long”, trying to improve lap on lap:

Each time round is another chance to improve, not just for the driver but for the teams, particularly during practice sessions, where real time telemetry allows the team to offer suggested changes as the car is on track, and pit stop allow physical (and computational?) changes to be made to the car.

Each lap is another iteration. Each stint is another iteration. Each session is another iteration. (If you only get 20 laps in a session, that could still give you fifty useful iterations, fifty chances to change something to see if it makes a useful difference.) Each race weekend is another iteration. Each season is another iteration.

Each iteration gives you a chance to try something new and compare it with what you’ve done before.

Who else iterates? Google does. Google (apparently) runs experiments all the time. Potentially, every page impression is another iteration to test the efficacy of their search engine results in terms of convert searchers to revenue generating clickers.

But the thing about iteration is that changes might have negative effects too, which is one reason why you need to iterate fast and often.

But business processes often appear to act as a brake on such opportunities.

Which is why I’ve learned to be very careful writing anything down… because organisations that have had time to build up an administration and a bureaucracy seem tempted to treat things that are written down as somehow fixed (even if those things are written down in socially editable documents (woe betide anyone who changes what you added to the document…)); things that are written down become STOPs in the iteration process. Things that are written down become cast in stone… become things that force you to go round in circles, rather than iterating…

Running Docker Container Compositions in the Cloud on Digital Ocean

With TM351 about to start, I thought I’d revisit the docker container approach to delivering the services required by the course to see if I could get something akin to the course software running in the cloud.

Copies of the dockerfiles used to create the images can be found on Github, with prebuilt images available on dockerhub (*).

A couple of handy ones out of the can are:

That said, at the current time, the images are not intended for use as part of the course

The following docker-compose.yml file will create a set of linked containers that resemble (ish!) the monolithic VM we distributed to students as a Virtualbox box.

    container_name: tm351-dockerui
    image: dockerui/dockerui
        - "35183:9000"
        - /var/run/docker.sock:/var/run/docker.sock
    privileged: true

    container_name: tm351-devmongodata
    command: echo mongodb_created
    #Share same layers as the container we want to link to?
    image: mongo:3.0.7
        - /data/db

    container_name: tm351-mongodb
    image: mongo:3.0.7
        - "27017:27017"
        - devmongodata
    command: --smallfiles

    container_name: tm351-mongodb-seed
    image: psychemedia/ou-tm351-mongodb-simple-seed
        - mongodb

    container_name: tm351-devpostgresdata
    command: echo created
    image: busybox
        - /var/lib/postgresql/data
    container_name: tm351-postgres
    image: psychemedia/ou-tm351-postgres
        - "5432:5432"

    container_name: tm351-openrefine
    image: psychemedia/tm351-openrefine
        - "35181:3333"
    privileged: true
    container_name: tm351-notebook
    #build: ./tm351_scipystacknserver
    image: psychemedia/ou-tm351-pystack
        - "35180:8888"
        - postgres:postgres
        - mongodb:mongodb
        - openrefine:openrefine
    privileged: true

Place a copy of the docker-compose.yml YAML file somewhere, from Kitematic, open the command line, cd into the directory containing the YAML file, and enter the command docker-compose up -d – the images are on dockerhub and should be downloaded automatically.

Refer back to Kitematic and you should see running containers – the settings panel for the notebooks container shows the address you can find the notebook server at.


The notebooks and OpenRefine containers should also be linked to shared folders in the directory you ran the Docker Compose script from.

Running the Containers in the Cloud – Docker-Machine and Digital Ocean

As well as running the linked containers on my own machine, my real intention was to see how easy it would be to get them running in the cloud and using just the browser on my own computer to access them.

And it turns out to be really easy. The following example uses cloud host Digital Ocean.

To start with, you’ll need a Digital Ocean account with some credit in it and a Digital Ocean API token:


(You may be able to get some Digital Ocean credit for free as part of the Github Education Student Developer Pack.)

Then it’s just a case of a few command line instructions to get things running using Docker Machine:

docker-machine ls
#kitematic usess: default

#Create a droplet on Digital Ocean
docker-machine create -d digitalocean --digitalocean-access-token YOUR_ACCESS_TOKEN --digitalocean-region lon1 --digitalocean-size 4gb ou-tm351-test 

#Check the IP address of the machine
docker-machine ip ou-tm351-test

#Display information about the machine
docker-machine env ou-tm351-test
#This returns necessary config details
#For example:
##export DOCKER_TLS_VERIFY="1"
##export DOCKER_HOST="tcp://IP_ADDRESS:2376"
##export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"
##export DOCKER_MACHINE_NAME="ou-tm351-test"
# Run this command to configure your shell: 
# eval $(docker-machine env ou-tm351-test)

#Set the environment variables as recommended
export DOCKER_HOST="tcp://IP_ADDRESS:2376"
export DOCKER_CERT_PATH="/Users/YOUR-USER/.docker/machine/machines/ou-tm351-test"

#Run command to set current docker-machine
eval "$(docker-machine env ou-tm351-test)"

#If the docker-compose.yml file is in .
docker-compose up -d
#This will launch the linked containers on Digital Ocean

#The notebooks should now be viewable at:

#OpenRefine should now be viewable at:

#To stop the machine
docker-machine stop ou-tm351-test
#To remove the Digital Ocean droplet (so you stop paying for it...
docker-machine rm ou-tm351-test

#Reset the current docker machine to the Kitematic machine
eval "$(docker-machine env default)"

So that’s a start. Issues still arise in terms of persisting state, such as the database contents, notebook files* and OpenRefine projects: if you leave the containers running on Digital Ocean to persist the state, the meter will keep running.

(* I should probably also build a container that demos how to bake a few example notebooks into a container running the notebook server and TM351 python distribution.)

Programmer Employability – Would HE Prepare You for a Guardian Developer Job Interview?

A post on the Guardian developer blog – The Guardian’s new pairing exercises – describes part of the recruitment process used by the Guardian when appointing new developers: candidates are paired with a Guardian developer and set a an exercise that assesses “how they approach solving a problem and structure code”.

Originally, all candidates were set the same exercise, but to try to reduce fatigue (in the sense of a loss of interest, enthusiasm or engagement) with the same task by the Guardian developers engaged in the pairing activity, a wider range of exercises were developed that the the paired developer can choose from when it’s their turn to work with a candidate.

The exercises used in the process are publicly available – github: guardian/pairing-tests.

Candidates can prepare for the test if they wish but as there are so many tests it is possible to be given an exercise not seen before. It also gives an idea of the skills the Guardian is looking for, the problems do not test knowledge on method names of a language of choice but instead focus on solving a problem in manageable parts over an hour.

One example is to implement a set of rules as per Conway’s Game of Life:

Requirement Cards:

  • A cell can be made “alive”
  • A cell can be “killed”
  • A cell with fewer than two live neighbours dies of under-population
  • A cell with 2 or 3 live neighbours lives on to the next generation
  • A cell with more than 3 live neighbours dies of overcrowding
  • An empty cell with exactly 3 live neighbours “comes to life”
  • The board should wrap

Other’s include the parsing and preliminary analysis of an election results dataset, or a set of controls for a simple robot.

So this got me wondering about a few things…:

  • how does this sort of activity design compare with the sort of assessment activity we give to students as part of a course?
  • how does the assessment of the exercise – in terms of what the recruiter learns about the candidate’s problem solving, programming/coding and interpersonal skills – compare with the assessment design and marking guides we use in HE, and by extension, the sort of assessment we train students up for?
  • how comfortable would a recent graduate be taking part in a paired exercise?

It also made me think again how unemployable I am!;-)

In passing, I also note that if you take a peek behind the Guardian homepage, they’re still running their developer ad there:


But how many graduates would think to look? (I seem to remember that sort of ad made me laugh out loud the first time I saw one whilst rooting through someone’s web page trying to figure out how they’d done something or other…)

(Hmmm… thinks: could that make the basis of an exercise – generate an ASCII-art text banner for an arbitrary phrase? Or emoji?! How would *I* go about doing that (other than justing installing a python package that already does it?! And if I did come up with an exercise and put in a pull request, or made a contribution to one of their other projects on github, would it land me an interview?!;-)

The Rise of Transparent Data Journalism – The BuzzFeed Tennis Match Fixing Data Analysis Notebook

The news today was lead in part by a story broken by the BBC and BuzzFeed News – The Tennis Racket – about match fixing in Grand Slam tennis tournaments. (The BBC contribution seems to have been done under the ever listenable File on Four: Tennis: Game, Set and Fix?)

One interesting feature of this story was that “BuzzFeed News began its investigation after devising an algorithm to analyse gambling on professional tennis matches over the past seven years”, backing up evidence from leaked documents with “an original analysis of the betting activity on 26,000 matches”. (See also: How BuzzFeed News Used Betting Data To Investigate Match-Fixing In Tennis, and an open access academic paper that inspired it: Rodenberg, R. & Feustel, E.D. (2014), Forensic Sports Analytics: Detecting and Predicting Match-Fixing in Tennis, The Journal of Prediction Markets, 8(1).)

Feature detecting algorithms such as this (where the feature is an unusual betting pattern) are likely to play an increasing role in the discovery of stories from data, step 2 in the model described in this recent Tow Center for Digital Journalism Guide to Automated Journalism:]


See also: Notes on Robot Churnalism, Part I – Robot Writers

Another interesting aspect of the story behind the story was the way in which BuzzFeed News opened up the analysis they had applied to the data. You can find it described on Github – Methodology and Code: Detecting Match-Fixing Patterns In Tennis – along with the data and a Jupyter notebook that includes the code used to perform the analysis: Data and Analysis: Detecting Match-Fixing Patterns In Tennis.


You can even run the notebook to replicate the analysis yourself, either by downloading it and running it using your own Jupyter notebook server, or by using the online mybinder service: run the tennis analysis yourself on

(I’m not sure if the BuzzFeed or BBC folk tried to do any deeper analysis, for example poking into point summary data as captured by the Tennis Match Charting Project? See also this Teniis Visuals project that makes use of the MCP data. Tennis etting data is also collected here: If you’re into the idea of analysing tennis stats, this book is one way in: Analyzing Wimbledon: The Power Of Statistics.)

So what are these notebooks anyway? They’re magic, that’s what!:-)

The Jupyter project is an evolution of an earlier IPython (interactive Python) project that included a browser based notebook style interface for allowing users to write and execute code, as well as seeing the result of executing the code, a line at a time, all in the context of a “narrative” text document. The Jupyter project funding proposal describes it thus:

[T]he core problem we are trying to solve is the collaborative creation of reproducible computational narratives that can be used across a wide range of audiences and contexts.

[C]omputation in science is ultimately in service of a result that needs to be woven into the bigger narrative of the questions under study: that result will be part of a paper, will support or contest a theory, will advance our understanding of a domain. And those insights are communicated in papers, books and lectures: narratives of various formats.

The problem the Jupyter project tackles is precisely this intersection: creating tools to support in the best possible ways the computational workflow of scientific inquiry, and providing the environment to create the proper narrative around that central act of computation. We refer to this as Literate Computing, in contrast to Knuth’s concept of Literate Programming, where the emphasis is on narrating algorithms and programs. In a Literate Computing environment, the author weaves human language with live code and the results of the code, and it is the combination of all that produces a computational narrative.

At the heart of the entire Jupyter architecture lies the idea of interactive computing: humans executing small pieces of code in various programming languages, and immediately seeing the results of their computation. Interactive computing is central to data science because scientific problems benefit from an exploratory process where the results of each computation inform the next step and guide the formation of insights about the problem at hand. In this Interactive Computing focus area, we will create new tools and abstractions that improve the reproducibility of interactive computations and widen their usage in different contexts and audiences.

The Jupyter notebooks include two types of interactive cell – editable text cells into which you can write simple markdown and HTML text that will be rendered as text; and code cells into which you can write executable code. Once executed, the results of that execution are displayed as cell output. Note that the output from a cell may be text, a datatable, a chart, or even an interactive map.

One of the nice things about the Jupyter notebook project is that the executable cells are connected via the Jupyter server to a programming kernel that executes the code. An increasing number of kernels are supported (e.g. for R, Javascript and Java as well as Python) so once you hook in to the Jupyter ecosystem you can use the same interface for a wide variety of computing tasks.

There are multiple ways of running Jupyter notebooks, including the mybinder approach described above, – I describe several of them in the post Seven Ways of Running IPython Notebooks.

As well as having an important role to play in reproducible data journalism and reproducible (scientific) research, notebooks are also a powerful, and expressive, medium for teaching and learning. For example, we’re just about to star using Jupyter notebooks, delivered via a virtual machine, for the new OU course Data management and analysis.

We also used them in the FutureLearn course Learn to Code for Data Analysis, showing how code could be used a line at a time to analyse a variety of opendata sets from sources such as the World Bank Indicators database and the UN Comtrade (import /export data) database.

PS for sports data fans, here’s a list of data sources I started to compile a year or so ago: Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post).