New Version of “Wrangling F1 Data With R” Just Released…

So I finally got round to pushing a revised (and typo corrected!) version of Wrangling F1 Data With R: A Data Junkie’s Guide, that also includes a handful of new section and chapters, including descriptions of how to detect undercuts, the new style race history chart that shows the on-track position of each driver for each lap of a race relative to the lap leader, and a range of qualifying session analysis charts that show the evolution of session cut off times and drivers’ personal best times.

Code is described for every data manipulation and chart that appears in the book, along with directions for how to get hold of (most of) the data required to generate the charts. (Future updates will cover some of the scraping techniques required to get of of the rest of it!)

As well as the simple book, there’s also a version bundled with the R code libraries that are loaded in as a short-cut in many of the chapters.

The book is published on Leanpub, which means you can access several different electronic versions of the book, and once you’ve bought a copy, you get access to any future updates for no further cost…

There is a charge on the book, with a set minimum price, but you also have the freedom to pay more! Any monies received for this book go to cover costs (I’ve started trying to pay for the webservices I use, rather than just keep using their free plan). If the monthly receipts bump up a little, I’ll try to get some services that generate some of the charts interactively hosted somewhere…

Using Vagrant to Launch OpenRefine Running in a Linux VM on Linode

A few days ago I saw Jim Groom having fun getting Sandstorm,io running on VM hosts Linode. I haven’t tried running full VMs on remote servers yet, so I thought have a quick look to see if I could get some chunks of the TM351 VM running on a Linode box using Vagrant.

I chickened out of just running the whole set of course VM build puppet scripts (they really need tidying up and the commented out files clearing away), but instead thought I’d start down a path of trying to reuse the simpler bash files I used for the latest docker attempt.

Cribbing from the Linode docs on Using Vagrant to Manage Linode Environments, I logged into Linode and got myself a shortlived API key, and then from a desktop terminal installed the vagrant-linode plugin – vagrant plugin install vagrant-linode – and got some keys (accepting the defaults): ssh-keygen -b 4096.

Here’s a simple Vagrantfile that launches an Ubuntu Trusty server on Linode and pops a copy of OpenRefine inside it…

Vagrant.configure(2) do |config|
  ## SSH Configuration
  config.ssh.username = 'user'
  config.ssh.private_key_path = '~/.ssh/id_rsa'

  #Server config
  config.vm.provider :linode do |provider, override| = 'linode'
    override.vm.box_url = " linode/raw/master/box/"

    #Linode Settings
    provider.token = 'MY_API_TOKEN'
    provider.distribution = 'Ubuntu 14.04 LTS'
    provider.datacenter = 'london'
    provider.plan = '2048'
    provider.label = 'vagrant-ubuntu-lts'

  config.vm.provision "shell", inline: <<-SHELL
    apt-get clean -y && apt-get -y update && apt-get -y upgrade && apt-get install -y wget unzip openjdk-7-jre-headless && apt-get clean -y
    cd /opt
    #Download OpenRefine
    wget --progress=bar:force -q --no-check-certificate
    #Unpack OpenRefine and tidy up
    tar -xzf openrefine-linux-2.6-rc1.tar.gz  && rm openrefine-linux-2.6-rc1.tar.gz
    mkdir /mnt/refine  
  config.vm.provision "shell", inline: <<-SH
    /opt/openrefine-2.6-rc1/refine -i -d /mnt/refine

Running vagrant up creates a Linode node, builds the VM and gets OpenRefine running. >tt>agrant destroy deletes the node; keeping the node running, or popping it into a suspended mode, leaves the meter running.

Using Jupyter Notebooks to Define Literate APIs

Part of the vision behind the Jupyter notebook ecosystem seems to be the desire to create a literate computing infrastructure that supports “the weaving of a narrative directly into a live computation, interleaving text with code and results to construct a complete piece that relies equally on the textual explanations and the computational components” (Fernando Perez, “Literate computing” and computational reproducibility: IPython in the age of data-driven journalism, 19/4/13).

The notebook approach complements other live document approaches such as the use of Rmd in applications such as RStudio, providing an interactive, editable rendered view of the live document, including inlined outputs, rather than just the source code view.

Notebooks don’t just have to be used for analysis though. A few months ago, I spotted a notebook being used to configure a database system, db-introspection-notebook – my gut reaction to which was to ponder Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?. (A problem with that approach, of course, is that it requires notebook machinery to get started, whereas you might typically want to run configuration scrips in as bare bones a system possible.)

Another post that caught my eye last week on Jupyter Notebooks as RESTful Microservices which uses notebooks to define an API using a new Jupyter Kernel Gateway:

[a] web server that supports different mechanisms for spawning and communicating with Jupyter kernels, such as:

  • A Jupyter Notebook server-compatible HTTP API for requesting kernels and talking the Jupyter kernel protocol with them over Websockets
  • A[n] HTTP API defined by annotated notebook cells that maps HTTP verbs and resources to code to execute on a kernel

Tooling to support the creation of a literate API then, that fully respects Fernando Perez’ description of literate computing?!

At first glance it looks like all the API functions need to be defined within a single notebook – the notebook run by the kernel gateway. But another Jupyter project in incubation allows notebooks to be imported into other notebooks, as this demo shows: Notebooks as Reusable Modules and Cookbooks. Which means that a parent API defining notebook could pull in dependent child notebooks that each define a separate API call.

And because the Jupyter server can talk to a wide range of language kernels, this means the API can implemented using a increasing range of languages (though I think that all the calls will need to be implemented using the same language kernel?). Indeed, the demo code has notebooks showing how to define notebook powered APIs in python and R.


See also: What’s On the Horizon in the Jupyter Ecosystem?

Wondering if Life Would be Easier With an OU – or FutureLearn – Compute Stick…?

A few days ago I came across a project that has been looking at digital preservation, and in particular the long term archiving of “functional” digital objects, such as software applications: bwFLA — Emulation as a Service [EaaS]. (I wonder how long that site will remain there…?!)

The Emulation-as-a-Service architecture simplifies access to preserved digital assets allowing end users to interact with the original environments running on different emulators.

I’d come across the project in part via search for examples of Docker containers being used via portable “compute sticks”. It seems that the bwFLA folk have been exploring two ways of making emulated services available: EaaS using Docker and a boot to emulation route from machine images on bootable USBs, although they don’t seem (yet) to have described a delivery system that includes a compute stick. (See a presentation on their work here. )


One of the things that struck me about the digital preservation process was the way in which things need to be preserved so that they can be run in an arbitrary future, or at least, in an arbitrary computing environment. In the OU context, where we have just started a course that shipped a set of interlinked applications to students via a virtual machine that could be run across different platforms, we’re already finding issues arising from flaky combinations of VirtualBox and Windows; what we really need to do is be shipping something that is completely self-bootable (but then, that may in turn, turn up problems?). So this got me thinking that when we design, and distribute, software to students it might make sense to think of the distribution process as an exercise in distributing preserved digital objects? (This actually has implications in a couple of senses: firstly, in terms of simply running the software: how can we distribute it so that students can run it; secondly, in terms of how we contextualise the versioning of the software – OU courses can be a couple of years in planning and five years in preservation, which means that the software we ship may be several versions behind the latest release, if the software has continued to be updated).

So if students have problems running software in virtual machines because of problems running the virtual machine container, what other solutions are there?

One way is to host the software and make it available as a service accessed via a web browser or other universal client, although that introduces two complications: firstly, the need for network access; secondly, ensuring that the browser (which is to say, browser-O/S combination?) or universal client can properly service the service…

A second way is to ship the students something bootable. This could be something like a live USB, or it could be a compute stick that comes with preinstalled software on it. In essence, we distribute the service rather than the software. (On this note, things like unikernels look interesting: just enough O/S to run the service or application you’re interested in.) There are cost implications here, of course, although the costs might scale differently depending on who pays: does the OU cover the cost of distribution (“free” to students); does the student pay at-cost and buy from the OU; does the student pay a commercial rate (eg covering their own hosting fees on a cloud service); and so on?

The means students have at their disposal for running software is also an issue. The OU has used to publish different computing specification guidelines for individual courses, but now I think a universal policy applies. From discussions I’ve had with various folk, I seem to be in a minority of one suggesting that students may in the future not have access to general purpose computers onto which they can install software applications, but instead may be using netbooks or even tablet computers to do their studies. (I increasingly use a cloud host to run services I want to make use of…)

I can see there is an argument for students needing access to a keyboard to make life easier when it comes to typing up assessment returns or hacking code, and also the need for access to screen real estate to make life easier reading course materials, but I also note that increasing numbers of students seem to have access to Kindles which provide a handy second screen way of accessing materials.

(The debate about whether we issue print materials or not continues… Some courses are delivered wholly online, others still use print materials. When discussions are held about how we deliver materials, the salient points for me are: 1) where the display surface is (print is a “second screen’ display surface that can be mimicked by a Kindle – and hence an electronically distributed text; separate windows/tabs on a computer screen are display surfaces within a display surface); 2) whether the display surface supports annotations (print does, beautifully); 3) the search,navigation and memory affordances of the display surface (books open to where you were last reading them, page corners can be folded, you have a sense of place/where you are in the text, and (spatial) memory of where you read things (in the book as we well as on the page); 4) where you can access the display surface (eg in the bath?); 5) whether you can arrange the spatial location of the display surface to place it in proximity to another display surface).

Print material doesn’t come without its own support issues though…

“But (computing) students need a proper computer”, goes the cry, although never really unpacked…

From my netbook browser (keyboard, touchpad, screen, internet connection, but not the ability to install and run “traditional” applications), I can, with a network connection, fire up an arbitrary number of servers in London, or Amsterdam, or Dublin, or the US, and run a wide variety of services. (We require students to have access to the internet so they can access the VLE…)

From my browser, I could connect to a Raspberry Pi, or presumably a compute stick (retailing at about £100), that could be running EaaS applications for me.

So I can easily imagine an “OU Compute Stick” – or “FutureLearn Compute Stick” – that I can connect to over wifi, that runs a Kitematic like UI that can install applications from an OU/FutureLearn container/image repository or from an inserted (micro)SD card. (For students with a “proper” computer, they’d be able to grab the containers off the card and run them on their own computer.)

At the start of their degree, students would get the compute stick; when they need to run OU/FutureLearn provided apps, they grab them from the OU-hub, or receive them in the post on an SD card (in the new course, we’ve noticed in some situations, some problems in downloading large files reliably). The compute stick would have enough computational power to run the applications, which could be accessed over wifi via a browser on a “real” computer, or a netbook (which has a keyboard), or a tablet computer, or even a mobile device. The compute stick would essentially be a completely OU managed environment, bootable, and with it’s own compute power. The installation problems would be reduced to finding a way for the stick to connect to the internet (and even that may not be necessary), and the student to connect to the stick.

Developing such a solution might also be of interest to the digital preservation folk…even better if the compute stick had a small screen so you could get see a glimpse at least of what the application looked like. Hmm..thinks… rather than a compute stick, would shipping students a smartphone rooted to run EaaS work?! Or do smartphones have the wrong sort of processor?

Using Spreadsheets That Generate Textual Summaries of Data – HSCIC

Having a quick peek at a dataset released today by the HSCIC on Accident and Emergency Attendances in England – 2014-15, I noticed that the frontispiece worksheet allowed you to compare the performance of two trusts with each other as well as against a national average. What particularly caught my eye was that the data for each was presented in textual form:


In particular, a cell formula is used to construct a templated sentence based using the selected item as a key on a lookup across tables in the other sheets:

=IF(AND($H$63="*",$H$66="*"),"• Attendances by gender have been suppressed for this provider.",IF($H$63="*","• Males attendance data has been suppressed. Females accounted for "&TEXT(Output!$H$67,"0.0%")&" (or "&TEXT($H$66,"#,##0")&") of all attendances.",IF($H$66="*","• Males accounted for "&TEXT(Output!$H$64,"0.0%")& " (or "&TEXT($H$63,"#,##0")&") of all attendances. Female attendance data has been suppressed.","• Males accounted for "&TEXT(Output!$H$64,"0.0%")& " (or "&TEXT($H$63,"#,##0")&") of all attendances, while "&TEXT(Output!$H$67,"0.0%")&" (or "&TEXT($H$66,"#,##0")&") were female.")))

For each worksheet, it’s easy enough to imagine a textual generator that maps a particular row (that is, the data for a particular NHS trust, for example) to a sentence or two (as per Writing Each Row of a Spreadsheet as a Press Release?).

Having written a simple sentence generator for one row, more complex generators can also be created that compare the values across two rows directly, giving constructions of the form The w in x for y was z, compared to r in p for q, for example.

So I wonder, has HSCIC been doing this for some time, and I just haven’t noticed? How about ONS? And are they also running data powered conversational Slack bots too?

Using Google to Look Up Where You Live via the Physical Location of Your Wifi Router

During a course team meeting today, I idly mentioned that we should be able to run a simple browser based activity involving the geolocation of a student’s computer based on Google knowing the location of their wifi router. I was challenged about the possibility of this, so I did a quick bit of searching to see if there was an easy way of looking up the MAC addresses (BSSID) of wifi access points that were in range, but not connected to:


which turned up:

The airport command with '-s' or '-I' options is useful: /System/Library/PrivateFrameworks/Apple80211.framework/Resources/airport


(On Windows, the equivalent is maybe something like netsh wlan show network mode=bssid ???)

The second part of the jigsaw was to try to find a way of looking up a location from a wifi access point MAC address – it seems that the Google geolocation API does that out of the can:

The_Google_Maps_Geolocation_API_ _ _Google_Maps_Geolocation_API_ _ _Google_Developers_and_Add_New_Post_‹_OUseful_Info__the_blog____—_WordPress

An example of how to make a call is also provided, as long as you have an API key… So I got a key and gave it a go:



Looking at the structure of the example Google calls, you can enter several wifi MAC addresses, along with signal strength, and the API will presumably triangulate based on that information to give a more precise location.

The geolocation API also finds locations from cell tower IDs.

So back to the idea of a simple student activity to sniff out the MAC addresses of wifi routers their computer can see from the workplace or home, and then look up the location using the Google geolocation API and pop it on a map.

Which is actually the sort of thing your browser will do when you turn on geolocation services:


But maybe when you run the commands yourself, it feels a little bit more creepy?

PS sort of very loosely related, eg in terms of trying to map spaces from signals in the surrounding aether, a technique for trying to map the insides of a room based on it’s audio signature in response to a click of the fingers:

Going Round in Circles… or Iterating?

Listening to F1 technical pundit Gary Anderson on a 2014 panel (via Joe Saward) about lessons from F1 for business, I was struck by his comment that “motor racing is about going round in drivers go round in circles all day long”, trying to improve lap on lap:

Each time round is another chance to improve, not just for the driver but for the teams, particularly during practice sessions, where real time telemetry allows the team to offer suggested changes as the car is on track, and pit stop allow physical (and computational?) changes to be made to the car.

Each lap is another iteration. Each stint is another iteration. Each session is another iteration. (If you only get 20 laps in a session, that could still give you fifty useful iterations, fifty chances to change something to see if it makes a useful difference.) Each race weekend is another iteration. Each season is another iteration.

Each iteration gives you a chance to try something new and compare it with what you’ve done before.

Who else iterates? Google does. Google (apparently) runs experiments all the time. Potentially, every page impression is another iteration to test the efficacy of their search engine results in terms of convert searchers to revenue generating clickers.

But the thing about iteration is that changes might have negative effects too, which is one reason why you need to iterate fast and often.

But business processes often appear to act as a brake on such opportunities.

Which is why I’ve learned to be very careful writing anything down… because organisations that have had time to build up an administration and a bureaucracy seem tempted to treat things that are written down as somehow fixed (even if those things are written down in socially editable documents (woe betide anyone who changes what you added to the document…)); things that are written down become STOPs in the iteration process. Things that are written down become cast in stone… become things that force you to go round in circles, rather than iterating…