Detecting Undercuts in F1 Races Using R

One of the things that’s been on my to do list for some time has been the identification of tactical or strategic events within a race that might be detected automatically. One such event is an undercut described by F1 journalist James Allen in the following terms (The secret of undercut and offset):

An undercut is where Driver A leads Driver B, but Driver B turns into the pits before Driver A and changes to new tyres. As Driver A is ahead, he’s unaware that this move is coming until it’s too late to react and he has passed the pit lane entry.
On fresh tyres, Driver B then drives a very fast “Out” lap from the pits. Driver A will react to the stop and pit on the next lap, but his “In” lap time will have been set on old tyres, so will be slower. As he emerges from the pit lane after his stop, Driver B is often narrowly ahead of him into the first corner.

In logical terms, we might characterise this as follows:

    • two drivers, d1 and d2: d1 !=d2;
    • d1 pits on lap X, and drives an outlap on lap X+1;
    • d1’s position on their pitlap (lap X) is greater than d2’s position on the same lap X;
    • d2 pits on lap X+1, with an outlap on lap X+2;
    • d2’s position on their outlap (lap X+2) is greater than d1’s position on the same lap X+2.

We can generalise this formulation, and try to make it more robust, by comparing positions on the lap prior to d1’s stop (lap A) with the positions on d2’s outlap (lap B):

        • two drivers, d1 and d2: d1 !=d2;
        • d1 pits on lap A+1;
        • d1’s position on their “prelap” (lap A), the lap prior to their pitlap (lap A+1), is greater than d2’s position on lap A; this condition tries to ensure that d1 is behind d2 as they enter the pit stop phase but it misses the effect on any first lap stops (unless we add a lap 0 containing the grid positions);
        • d1’s outlap is on lap A+2;
        • d2 pits on lap B-1 within the inclusive range [lap A+2, lap A+1+N]: N>=1, (that is, within N laps of D1’s stop) with an outlap on lap B; the parameter, N, allows us to test for changes of position within a pit stop window, rather than requiring that d2 pits on the lap immediately following d1’s stop;
        • d2’s position on their outlap (lap B, in the inclusive range [lap A+3, lap A+2+N]) is greater than d1’s position on the same lap B.

One way of implementing these constraints is to write a declarative style query that specifies the conditions we want the solution to meet, rather than writing a procedural programme to find such an answer. Using the sqldf package, we can use a SQL query to achieve just this result.

One way of writing the query is to create two situations, a and b, where situation a corresponds to a lap on which d1 stops, and situation b corresponds to the driver d2’s stop. We then capture the data for each driver in each situation, to give four data states: d1a, d1b, d2a, d2b. These states are then subjected to the conditions specified above (using N=5).

#First get laptime data from the ergast API
lapTimes=lapsData.df(2015,9)

#Now find pit times
p=pitsData.df(2015,9)

#merge pitdata with lapsdata
lapTimesp=merge(lapTimes, p, by = c('lap','driverId'), all.x=T)

#flag pit laps
lapTimesp$ps = ifelse(is.na(lapTimesp$milliseconds), F, T)

#Ensure laps for each driver are sorted
library(plyr)
lapTimesp=arrange(lapTimesp, driverId, lap)

#do an offset on the laps that are pitstops for each driver
#to set outlap flags for each driver
lapTimesp=ddply(lapTimesp, .(driverId), transform, outlap=c(FALSE, head(ps,-1)))

#identify lap before pit lap by reverse sorting
lapTimesp=arrange(lapTimesp, driverId, -lap)
#So we can do an offset going the other way
lapTimesp=ddply(lapTimesp, .(driverId), transform, prelap=c(FALSE, head(ps,-1)))

#tidy up
lapTimesp=arrange(lapTimesp,acctime)

#Now we can run the SQL query
library(sqldf)
ss=sqldf('SELECT d1a.driverId AS d1, d2a.driverId AS d2, \
            d1a.lap AS A, d1a.position AS d1posA, d1b.position AS d1posB, \
            d2b.lap AS B, d2a.position AS d2posA, d2b.position AS d2posB \
          FROM lapTimesp d1a, lapTimesp d1b, lapTimesp d2a, lapTimesp d2b \
          WHERE d1a.driverId=d1b.driverId AND d2a.driverId=d2b.driverId \
            AND d1a.driverId!=d2a.driverId \
            AND d1a.prelap AND d1a.lap=d2a.lap AND d2b.outlap AND d2b.lap=d1b.lap \
            AND (d1a.lap+3<=d1b.lap AND d1b.lap<=d1a.lap+2+5) \
            AND d1a.position>d2a.position AND d1b.position < d2b.position')

For the 2015 British Grand Prix, here’s what we get:

          d1         d2  A d1posA d2posA  B d1posB d2posB
1  ricciardo      sainz 10     11     10 13     12     13
2     vettel      kvyat 13      8      7 19      8     10
3     vettel hulkenberg 13      8      6 20      7     10
4      kvyat hulkenberg 17      6      5 20      9     10
5   hamilton      massa 18      3      1 21      2      3
6   hamilton     bottas 18      3      2 22      1      3
7     alonso   ericsson 36     11     10 42     10     11
8     alonso   ericsson 36     11     10 43     10     11
9     vettel     bottas 42      5      4 45      3      5
10    vettel      massa 42      5      3 45      3      4
11     merhi    stevens 43     13     12 46     12     13

With a five lap window we have evidence that supports successful undercuts in several cases, including VET taking KVY and HUL with his early stop at lap 13+1 (KVY pitting on lap 19-1 and HUL on lap 20-1), and MAS and BOT both being taken first by HAM’s stop at lap 18+1 and then by VET’s stop at lap 42+1.

To make things easier to read, we may instead define d1a.lap+1 AS d1Pitlap and d2b.lap-1 AS d2Pitlap.

The query doesn’t guarantee that the pit stop was responsible for change in order, but it does at least gives us some prompts as to where we might look.

Tata F1 Connectivity Innovation Prize, 2015 – Mood Board Notes

It’s probably the wrong way round to do this – I’ve already done an original quick sketch, after all – but I thought I’d put together a mood board collection of images and design ideas relating to the 2015 Tata F1 Connectivity Innovation Prize to see what else is current in the world of motorsport telemetry display design just in case I do get round to entering the competition and need to bulk up the entry a bit with additional qualification…

First up, some imagery from last year’s challenge brief – and F1 timing screen; note the black background the use of a particular colour palette:

In the following images, click through on the image to see the original link.

How about some context – what sort of setting might the displays be used in?

20140320-img1rsr_HotSpot full

Aston-Martin-Racing-Bahrain-25-580x435image1.img.640.medium

From flatvision case study of the Lotus F1 pit wall basic requirements include:

  • sunlight readable displays to be integrated within a mobile pit wall;
  • a display bright enough to be viewed in all light conditions.

The solution included ‘9 x 24” Transflective sunlight readable monitors, featuring a 1920×1200 resolution’. SO that fives some idea of real estate available per screen.

So how about some example displays…

The following seems to come from a telemetry dash product:

telemetry_main_screen

There’s a lot of text on that display, and what also looks like timing screen info about other cars. The rev counter uses bar segments that increase in size (something I’ve seen on a lot of other car dashboards). The numerbs are big and bold, with units identifying what the value relates to.

The following chart (the engineer’s view from something called The Pitwall Project) provides an indication of tyre status in the left hand column, with steering and pedal indicators (i.e. driver actions) in the right hand column.

engineerinaction

Here’s a view (from an unknown source) that also displays separate tyre data:

index

 

Another take on displaying the wheel layout and a partial gauge view in the right hand column:

GAUGE track

Or how about relating tyre indicator values even more closely to the host vehicle?

file-1544-005609

This Race Technology Monitor screen uses a segmented bar for the majority of numerical displays. These display types give a quantised display, compared to the continuously varying numerical indicator. The display also shows historical traces, presumably of the corresponding quantity?

telem1

The following dashes show a dial rich view compared to a more numerical display:

download (1) download

The following sample dash seems to be a recreation for a simulation game? Note the spectrum coloured bar that has a full range outline, and the use of colour in the block colour background indicators. Note also the combined bar and label views (the label in the mid-point of the bar – which means is may have to straddle two differently coloured backgrounds.

RealTimeTelemetry2.3maxresdefault

 

 

 

 

 

 

The following Sim Racing Data Analysis display uses markers on the bars to identify notable values – max and min, perhaps? Or max and optimal?

data-analysis

 

It also seems like there are a few mobile apps out there doing dashboard style displays – this one looks quite clean to me and demonstrates a range of colour and border styles:

unnamed unnamed (2)screen480x480  unnamed (1)

 

Here’s another app – and a dial heavy display style:

unnamed (3)

Finally, some views on how to capture the history of the time series. The first one is health monitoring data – as you;d expect from health-monitoring related displays, it’s really clean looking:

FIA-seeking-real-time-human-telemetry-for-F1

 

I’m guessing the time-based trace goes left to right, but for our purposes, streaming the history from right to left, with the numerical indicator essentially bleeding into the line chart display, could work?

This view shows a crude way of putting several scales onto one line chart area:

telemetory

 

This curiosity is from one of my earlier experiments – “driver DNA”. For each of the four bands, lap count is on the vertical axis, distance round lap on the horizontal axis. The colour is the indicator value. The advantage of this is that you see patterns lap on lap, but the resolution of the most current value in a realtime trace might be hard to see?

china_race_data_gears

 

Some time ago, The Still design agency posted a concept pitwall for McLaren Mercedes. The images are still lurking in the Google Image search cache, and include example widgets:

widgets

and an example tyre health monitor display:

tyre-health

To my eye, those numbers are too far apart (the display is too wide and likely occluded by the animated line charts), and the semantics of the streaming are unclear (if the stream flows from the number, new numbers will come in at the left for the left hand tyres and from the right for the right hand ones?

And finally, an example of a typical post hoc race data capture analysis display/replay.

maxresdefault (1)Where do you start to look?!

PS in terms of implementation, a widget display seems sensible. Something like freeboard looks like it could provide a handy prototyping tool, or something like the RShiny dashboard backed up by JSON streaming support from jsonlite and HTML widgets wrapped by htmlwidgets.

Sports Data and R – Scope for a Thematic (Rather than Task) View? (Living Post)

Via my feeds, I noticed a package announcement today for cricketR!, a new package for analysing cricket performance data.

This got me wondering (again!) about what other sports related packages there might be out there, either in terms of functional thematic packages (to do with sport in general, or one sport in particular), or particular data packages, that either bundle up sports related data sets, or provide and API (that is, a wrapper for an official API, or a wrapper for a scraper that extracts data from one or more websites in a slightly scruffier way!)

This is just a first quick attempt, an unstructured listing that may also include data sets that are more generic than R-specific (eg CSV datafiles, or SQL database exports). I’ll try to keep this post updated as I find/hear about more packages, and also work a bit more on structuring it a little better. I really should pist this as a wiki somewhere – or perhaps curate something on Github?

  • generic:
    • SportsAnalytics [CRAN]: “infrastructure for sports analysis. Anyway, currently it is a selection of data sets, functions to fetch sports data, examples, and demos”.
    • PlayerRatings [CRAN]: “schemes for estimating player or team skill based on dynamic updating. Implemented methods include Elo, Glicko and Stephenson” (via Twitter: @UTVilla)
  • athletics:
    • olympic {ade4} [Inside-R packages]: “performances of 33 men’s decathlon at the Olympic Games (1988)”.
    • decathlon {GDAdata} [CRAN]: “Top performances in the Decathlon from 1985 to 2006.” (via comments: Antony Unwin)
    • MexLJ {GDAdata} [CRAN]: “Data from the longjump final in the 1968 Mexico Olympics.” (via comments: Antony Unwin)
  • baseball:
  • basketball:
  • cricket:
  • darts:
    • darts [CRAN]: “Statistical Tools to Analyze Your Darts Game” (via comments: @MarchiMax)
  • football (American football):
  • football (soccer):
    • engsoccerdata [Github]: “a repository for complete soccer datasets, along with some built-in functions for analyzing parts of the data. Currently includes English League data, FA Cup data, Playoff data, some European leagues (Spain, Germany, Italy, Holland).”. Citation: James P. Curley (2015). engsoccerdata: English Soccer Data 1871-2015. R package version 0.1.4
    • UKSoccer {vcd} [Inside-R packages]: data “on the goals scored by Home and Away teams in the Premier Football League, 1995/6 season.”.
    • Soccer {PASWR} [Inside-R packages]: “how many goals were scored in the regulation 90 minute periods of World Cup soccer matches from 1990 to 2002″.
  • fbRanks [CRAN]: “Association Football (Soccer) Ranking via Poisson Regression: time dependent Poisson regression and a record of goals scored in matches to rank teams via estimated attack and defense strengths” (via comments: @MarchiMax)
  • golf:
  • gymnastics:
  • horse racing:
    • RcappeR [Github]: “tools to aid the analysis and handicapping of Thoroughbred Horse Racing” (via Twitter: @UTVilla)
    • rBloodstock [Github]: “datasets from Thoroughbred Bloodstock Sales, Tattersalls sales from 2010 to 2015 (incomplete)” (via Twitter: @UTVilla)
  • ice hockey:
    • nhlscrapr [CRAN]: “routines for extracting play-by-play game data for regular-season and playoff
      NHL games, particularly for analyses that depend on which players are on the ice”
      . [via comments – Triplethink]
    • hockey {gamlr} [Inside-R packages]: “information about play configuration and the players on ice (including goalies) for every goal from 2002-03 to 2012-13 NHL seasons” [via comments – Triplethink]
    • nhl-pbp [Github]: “code to parse and analyze NHL PBP data using R”.
  • motor sport:
  • skiing:
    • SpeedSki {GDAdata} [CRAN]: “World Speed Skiing Competition, Verbier 21st April, 2011.” (via comments: Antony Unwin)
  • snooker:
  • swimming:
  • tennis:
    • tennis_MatchChartingProject: “The goal of the Match Charting Project (MCP) is to amass detailed records of professional matches.”.
    • servevolleyR [Github]: “R package for simulating tennis points:games:tiebreaks:sets:matches” (via Twitter: @UTVilla)

It would perhaps make more sense to try to collect rather more structured (meta)data for each package. For example: homepage, sport/discipline; analysis, data (package or API), or analysis and data; if data: year-range, source, data coverage (e.g. table column headings); if analysis, brief synopsis of tools available (e.g. chart generators).

If you know of any others, please let me know via the comments and I’ll try to keep this page updated with a reasonably current list.

As well as packages, here are some links to blog posts that look at sports data analysis using R:

Again, if you can recommend further posts, please let me know via the comments.

Tata F1 Connectivity Innovation Prize, 2015 – Telemetry Dash

I’ve run out of time trying to put together an entry for this year’s first round of the Tata F1 Connectivity Innovation Prize (brief [PDF]), which this year sets the challenge of designing a data dashboard that displays the following information:

prize_tatacommunications_com_Challenge_1_Brief_for_the_F1_Connectivity_Innovation_Prize_pdf

Just taking at the data as presented turns it into somethign of an infographic design exercise (I wonder if anyone will submit entries using infogr.am?!) but the reality is much more that it needs to be a real-time telemetry dashboard.

My original sketch is a just a riff on the data as given:

dash1

(From the steering angle range, I guess straight ahead must be 90 degrees?! Which would put 12 degrees as a stupidly left turn? Whatever… you get the idea!)

Had I had time, I think I’d have extended this to include historical traces of some of the data, eg using something like the highcharts dynamic demo that could stream a recent history of particular values, perhaps taking inspiration too from Making Sense of Squiggly Lines. One thing I did think might be handy in this context were “sample and hold” colour digit or background alerts which would retain a transient change for a second or two – for example, recording that the steering wheel had been given a quick left-right – that could direct attention to the historical trace if the original incident was missed or needed clarification.

The positioning RPM then throttle is based on the idea that the throttle is a request for revs. Putting throttle (racing green for go) and brake (red for stop) next to each other allows control commands to be compared, and putting brake and speed (Mercedes silver/grey – these machines are built for speed) next to each other is based on the idea you brake to retard the vehicle (i.e. slow it down). (I also considered the idea of the speed indicator as a vertical bar close to the gear indicator, but felt that if the bars are rapidly changing, which they are likely to do, it could be quite jarring if vertical bars are going up and down t right angles to each other? What I hope the current view would do is show more of a ratchet effect across all the bars?) The gear indicator helps group these four indicators together. (I think that white background should actually be yellow?) In the event of a gear being lost, the colour of that gear number could fade further in grey towards black. A dot to the right of the scale could also indicate the ideal gear to be in at any particular point.

The tyre display groups the tyres and indicates steering angle as well as tyre temperature colour coded according to a spectrum colour scale. (The rev counter is also colour coded.) The temperature values are also displayed in an grid to allow for easy comparison, and again are match-colour coded. The steering angle is also displayed as a numerical indicator value, and further emphasised by the Mercedes logo (Mercedes are co-sponsoring the competition, I think? That said, I suspect their brand police, if they are anything like the OU’s, may have something to say about tilting the logo though?!) The battery indicator (CC: “Battery icon” by Aldendino) is grouped with the tyres on the grounds that battery and tyres are both resources that need to be managed.

In additional material, I’d possibly also have tried to demo some alerts, such as an overcooked tyre (note the additional annotation that that should have been in the original showing the degrees C unit):

Presentation1

and perhaps also included a note about possible additional channels – hinting at tyre pressure based on the size of each tyre, perhaps, or showing where another grid showing individual tyre pressure might go, or (more likely) assuming a user-interactive display, a push to toggle view, or even a toggling display, that shows tyre pressure or pressure in the same location at different times. There probably also needs to be some sort of indication of brake balance in there too – perhaps a dot that moves around the central grid cross, perhaps connected by a line to the origin of the cross?

The brief also asks for some info about refresh rates – Tata are in the comms business after all… I guess things to take into account are the bandwidth of the telemetry from the car (2 megabits per second looks reasonable?), the width of the data from each sensor, along with their sampling rates (info from ECU specs) and perhaps a bit of psychology (what sorts of refresh rate can folk cope with when numerical digit displays update, for example (e.g. when watching a time code on a movie?). Maybe also check out some bits and pieces on realtime dashboard design) and example race dashboard designs to see what sorts of metaphor or common design styles are likely to be familiar to team staff (and hence not need as much decoding). Looking back at last year’s challenge might also be useful. E.g. the timing screen whose data feed was a focus there used a black background and a palette of white, yellow, green, purple, cyan and red. There are conventions associated with those colours that could perhaps be drawn on here. (Also, using those colours perhaps make sense in that race teams are likely to be familiar with distinguishing those colours and associating meaning with them.)

I’ve never really tried to put a dashboard together… There’s lots to consider, isn’t there?!

Literate DevOps? Could We Use IPython Notebooks To Build Custom Virtual Machines?

A week or so ago I came across a couple of IPython notebooks produced by Catherine Devlin covering the maintenance and tuning of a PostgreSQL server: DB Introspection Notebook (example 1: introspection, example 2: tuning, example 3: performance checklist). One of the things we have been discussing in the TM351 course team meetings is the extent to which we “show our working” to students in terms how the virtual machine and the various databases used in the course were put together, even if we don’t actually teach that stuff.

Notebooks make an ideal way of documenting the steps taken to set up a particular system, blending commentary with command line as well as code executable cells.

The various approaches I’ve explored to build the VM have arguably been over-complex – vagrant, puppet, docker and docker-compose – but I’ve always seen the OU as a place where we explore the technologies we’re teaching – or might teach – in the context of both course production and course delivery (that is, we can often use a reflexive approach whereby the content of the teaching also informs the development and delivery of the teaching).

In contrast, in A DevOps Approach to Common Environment Educational Software Provisioning and Deployment I referred to a couple of examples of a far simpler approach, in which common research, or research and teaching, VM environments were put together using simple scripts. This approach is perhaps more straightforwardly composable, in that if someone has additional requirements of the machine, they can just add a simple configuration script to bring in those elements.

In our current course example, where the multiple authors have a range of skill and interest levels when it comes to installing software and exploring different approaches to machine building, I’m starting to wonder whether I should have started with a simple base machine running just an IPython notebook server and no additional libraries or packages, and then created series of notebooks, one for each part of the course (which broadly breaks down to one part per author), containing instructions for installing all the bits and pieces required for just that part of the course. If there’s duplication across parts, trying to install the same thing for each part, that’s fine – the various package managers should be able to cope with that. (The only issue would arise if different authors needed different versions of the same package, for some reason, and I’m not sure what we’d do in that case?)

The notebooks would then include explanatory documentation and code cells to install Linux packages and python packages. Authors could take over the control of setup notebooks, or just make basic requests. At some level, we might identify a core offering (for example, in our course, this might be the inclusion of the pandas package) that might be pushed up into a core configuration installation notebook executed prior to the installation notebook for each part.

Configuring the machine would then be a case of running the separate configuration notebooks for each part (perhaps preceded by a core configuration notebook), perhaps by automated means. For example, ipython nbconvert --to=html --ExecutePreprocessor.enabled=True configNotebook_1.ipynb will [via StackOverflow]. This generates an output HTML report from running the code cells in the notebook (which can include command line commands) in a headless IPython process (I think!).

The following switch may also be useful (it clears the output cells): ipython nbconvert --to=pdf --ExecutePreprocessor.enabled=True --ClearOutputPreprocessor.enabled=True RunMe.ipynb (note in this case we generate a PDF report).

To build the customised VM box, the following route should work:

– set up a simple Vagrant file to import a base box
– install IPython into the box
– copy the configuration notebooks into the box
– run the configuration notebooks
– export the customised box

This approach has the benefits of using simple, literate configuration scripts described within a notebook. This makes them perhaps a little less “hostile” than shell scripts, and perhaps makes it easier to build in tests inline, and report on them nicely. (If a running a cell results in an error, I think the execution of the notebook will stop at that point?) The downside is that to run the notebooks, we also need to have IPython installed first.

A DevOps Approach to Common Environment Educational Software Provisioning and Deployment

In Distributing Software to Students in a BYOD Environment, I briefly reviewed a paper that described a paper that reported on the use of Debian metapackages to support the configuration of Linux VMs for particular courses (each course has its own Debian metapackage that could install all the packages required for that course).

This idea of automating the build of machines comes under the wider banner of DevOps (development and operations). In a university setting, we might view this in several ways:

  • the development of course related software environments during course production, the operational distribution and deployment of software to students, updating and support of the software in use, and maintenance and updating of software between presentations of a course;
  • the development of software environments for use in research, the operation of those environments during the lifetime of a resarch project, and the archiving of those environments;
  • the development and operation of institutional IT services.

In an Educause review from 2014 (Cloud Strategy for Higher Education: Building a Common Solution, EDUCAUSE Center for Analysis and Research (ECAR) Research Bulletin, November, 2014 [link]), a pitch for universities making greater use of cloud services, the authors make the observation that:

In order to make effective use of IaaS [Infrastructure as a Service], an organization has to adopt an automate-first mindset. Instead of approaching servers as individual works of art with artisan application configurations, we must think in terms of service-level automation. From operating system through application deployment, organizations need to realize the ability to automatically instantiate services entirely from source code control repositories.

This is the approach I took from the start when thinking about the TM351 virtual machine, focussing more on trying to identify production, delivery, support and maintenance models that might make sense in a factory production model that should work in a scaleable way not only across presentations of the same course, as well as across different courses, but also across different platforms (students own devices, OU managed cloud hosts, student launched commercial hosts) rather than just building a bespoke, boutique VM for a single course. (I suspect the module team would have preferred my focussing on the latter – getting something that works reliably, has been rigorously tested, and can be delivered to students rather than pfaffing around with increasingly exotic and still-not-really-invented-yet tooling that I don’t really understand to automate production of machines from scratch that still might be a bit flaky!;-)

Anyway, it seems that the folk at Berkeley have been putting together a “Common Scientific Compute Environment for Research and Education” [Clark, D., Culich, A., Hamlin, B., & Lovett, R. (2014). BCE: Berkeley’s Common Scientific Compute Environment for Research and Education, Proceedings of the 13th Python in Science Conference (SciPy 2014).]


The BCE – Berkeley Common Environment – is “a standard reference end-user environment” consisting of a simply skinned Linux desktop running in virtual machine delivered as a Virtualbox appliance that “allows for targeted instructions that can assume all features of BCE are present. BCE also aims to be stable, reliable, and reduce complexity more than it increases it”. The development team adopted a DevOps style approach customised for the purposes of supporting end-user scientific computing, arising from the recognition that they “can’t control the base environment that users will have on their laptop or workstation, nor do we wish to! A useful environment should provide consistency and not depend on or interfere with users’ existing setup”, further “restrict[ing] ourselves to focusing on those tools that we’ve found useful to automate the steps that come before you start doing science. Three main frames of reference were identified:

  • instructional: students could come from all backgrounds and often unlikely to have sys admin skills over and above the ability to use a simple GUI approach to software installation: “The most accessible instructions will only require skills possessed by the broadest number of people. In particular, many potential students are not yet fluent with notions of package management, scripting, or even the basic idea of commandline interfaces. … [W]e wish to set up an isolated, uniform environment in its totality where instructions can provide essentially pixel-identical guides to what the student will see on their own screen.”
  • scientific collaboration: that is, the research context: “It is unreasonable to expect any researcher to develop code along with instructions on how to run that code on any potential environment.” In addition, “[i]t is common to encounter a researcher with three or more Python distributions installed on their machine, and this user will have no idea how to manage their command-line path, or which packages are installed where. … These nascent scientific coders will have at various points had a working system for a particular task, and often arrive at a state in which nothing seems to work.”
  • Central support: “The more broadly a standard environment is adopted across campus, the more familiar it will be to all students”, with obvious benefits when providing training or support based on the common environment.

Whilst it was recognised the personal laptop computers are perhaps the most widely used platform, the team argued that the “environment should not be restricted to personal computers”. Some scientific computing operations are likely to stretch the resources of a personal laptop, so the environment should also be capable of running on other platforms such as hefty workstations or on a scientific computing cloud.

The first consideration was to standardise on an O/S: Linux. Since the majority of users don’t run Linux machines, this required the use of a virtual machine (VM) to host the Linux system, whilst still recognising that “one should assume that any VM solution will not work for some individuals and provide a fallback solution (particularly for instructional environments) on a remote server”.

Another issue that can arise is dealing with mappings between host and guest OS, which vary from system to system – arguing for the utility of an abstraction layer for VM configuration like Vagrant or Packer … . This includes things like portmapping, shared files, enabling control of the display for a GUI vs. enabling network routing for remote operation. These settings may also interact with the way the guest OS is configured.

Reflecting on the “traditional” way of building a computing environment, the authors argued for a more automated approach:

Creating an image or environment is often called provisioning. The way this was done in traditional systems operation was interactively, perhaps using a hybrid of GUI, networked, and command-line tools. The DevOps philosophy encourages that we accomplish as much as possible with scripts (ideally checked into version control!).

The tools explored included Ansible, packer, vagrant and docker:

  • Ansible: to declare what gets put into the machine (alternatives include shell scripts, puppet etc. (For the TM351 monolithic VM, I used puppet.) End-users don’t need to know anything about Ansible, unless they want to develop a new, reproducible, custom environment.
  • packer: used to run the provisioners and construct and package up a base box. Again, end-users don’t need to know anything about this. (For the TM351 monolithic VM, I used vagrant to build a basebox in Virtualbox, and then package it; the power of Packer is that is lets you generate builds from a common source for a variety of platforms (AWS, Virtualbox, etc etc).)
  • vagrant: their description is quite a handy one: “a wrapper around virtualization software that automates the process of configuring and starting a VM from a special Vagrant box image … . It is an alternative to configuring the virtualization software using the GUI interface or the system-specific command line tools provided by systems like VirtualBox or Amazon. Instead, Vagrant looks for a Vagrantfile which defines the configuration, and also establishes the directory under which the vagrant command will connect to the relevant VM. This directory is, by default, synced to the guest VM, allowing the developer to edit the files with tools on their host OS. From the command-line (under this directory), the user can start, stop, or ssh into the Vagrant-managed VM. It should be noted that (again, like Packer) Vagrant does no work directly, but rather calls out to those other platform-specific command-line tools.” However, “while Vagrant is conceptually very elegant (and cool), we are not currently using it for BCE. In our evaluation, it introduced another piece of software, requiring command-line usage before students were comfortable with it”. This is one issue we are facing with the TM351 VM – current the requirement to use vagrant to manage the VM from the commandline (albeit this only really requires a couple of commands – we can probably get away with just: vagrant up && vagrant provision and vagrant suspend – but also has a couple of benefits, like being able to trivially vagrant ssh in to the VM if absolutely necessary…).
  • docker: was perceived as adding complexity, both computationally and conceptually: “Docker requires a Linux environment to host the Docker server. As such, it clearly adds additional complexity on top of the requirement to support a virtual machine. … the default method of deploying Docker (at the time of evaluation) on personal computers was with Vagrant. This approach would then also add the complexity of using Vagrant. However, recent advances with boot2docker provide something akin to a VirtualBox-only, Docker-specific replacement for Vagrant that eliminates some of this complexity, though one still needs to grapple with the cognitive load of nested virtual environments and tooling.” The recent development of Kitematic addresses some of the use-case complexity, and also provides GUI based tools for managing some of the issues described above associate with portmapping, file sharing etc. Support for linked container compositions (using Docker Compose) is still currently lacking though…

At the end of the day, Packer seems to rule them all – coping as it does with simple installation scripts and being able to then target the build for any required platform. The project homepage is here: Berkeley Common Environment and the github repo here: Berkeley Common Environment (Github).

The paper also reviewed another common environment – OSGeo. Once again built on top of a common Linux base, well documented shell scripts are used to define package installations: “[n]otably, the project uses none of the recent DevOps tools. OSGeo-Live is instead configured using simple and modular combinations of Python, Perl and shell scripts, along with clear install conventions and examples. Documentation is given high priority. … Scripts may call package managers, and generally have few constraints (apart from conventions like keeping recipes contained to a particular directory)”. In addition, “[s]ome concerns, like port number usage, have to be explicitly managed at a global level”. This approach contrasts with the approach reviewed in Distributing Software to Students in a BYOD Environment where Debian metapackages were used to create a common environment installation route.


The idea of a common environment is a good one, and that would work particularly well in a curriculum such as Computing, I think? One main difference between the BCE approach and the TM351 approach is that BCE is self-contained and runs a desktop environment within the VM, whereas the TM351 environment uses a headless VM and follows more of a microservice approach that publishes HTML based service UIs via http ports that can be viewed in a browser. One disadvantage of the latter approach is that you need to keep a more careful eye on port assignments (in the event of collisions) when running the VM locally.

A Couple of Interesting Interactive Data Storytelling Devices

A couple of interesting devices for trying to engage folk in a data mediated story. First up, a chart that is reminiscent in feel to Hans Rosling’s ignorance test, in which (if you aren’t familiar with it) audiences are asked a range of data-style questions and then demonstrate their ignorance based on their preconceived ideas about what they think the answer to the question is – and which is invariably the wrong answer (with the result that audiences perform worse than random – or as Hans Rosling often puts is, worse than a chimpanzee; by the by, Rosling recently gave a talk at the World Bank, which included a rendition of the ignorance test. Rosling’s dressing down of the audience – who make stats based policy and help spend billions in the areas covered by Rosling’s questions, yet still demonstrated their complete lack of grasp about what the numbers say – is worth watching alone…).

Anyway – the chart comes from the New York Times, in a post entitled You Draw It: How Family Income Predicts Children’s College Chances. A question is posed and the reader is invited to draw the shape of the curve they think describes the relationship between family income and college chances:

nty-youDrawIt

Once you’ve drawn your line and submitted it, you’re told how close your answer is the the actual result:
You_Draw_It__How_Family_Income_Predicts_Children’s_College_Chances_-_The_New_York_Times

Another display demonstrates the general understanding calculated from across all the submissions.

You_Draw_It__How_Family_Income_Predicts_Children’s_College_Chances_-_The_New_York_Times2

Textual explanations also describe the actual relationship, putting it into context and trying to explain the basis of the relationship. As ever, a lovely piece of work, and once again with Amanda Cox in the credits…

The second example comes from Bloomberg, and riffs on the idea of immersive stories to produce a chart that gets updated as you scroll through (I saw this described as “scrollytelling” by @arnicas):

What_s_Really_Warming_the_World__Climate_deniers_blame_natural_factors__NASA_data_proves_otherwise

The piece is about global warming and shows the effect of various causal factors on temperature change, at first separately, and then in additive composite form to show how they explain the observed increase. It’s a nice device for advocacy, methinks…

It also reminds me that I never got round to trying to the Knight Lab Storymap.js with a zoomified/tiled chart image as the basis for a storymap (or should that be, storychart? For other storymappers, see Seven Ways to Create a Storymap). I just paid out the $19 or so for a copy of zoomify to convert large images to tilesets to work with that app, though I guess I really should have tried to hack a solution out with something like Imagemagick (I think that can export tiles?) or Inkscape (which would let me convert a vector image to tiles, I think?). Anyway, I just need a big image to try out now, which I guess I could generate from some F1 data using ggplot?