Jupyter is Not Just Notebooks

Last week, I filled an hour in a department seminar showing ways in which we could use to Jupyter notebooks to support the creation and use of interactive educational materials.

I’ve no idea if it converted anyone to the cause.

I could have done any number of other talks — about the architecture of the Jupyter ecosysytem more widely (at least, insofar as I understand), or the way in which Jupyter makes sense for reproducible research and how it fits into a containerised / virtualised way of working.

Because Jupyter is not just about notebooks.

It’s also about string and glue.

Here’s something I suddenly grokked the other day whilst chatting to somebody about different ways of accessing applications that have a graphical UI… (on a desktop, on a desktop in a VM, via X11 (“what’s that?” they asked… sigh…), via a browser if is has an HTML UI, via novnc in a browser window if it doesn’t (albeit w/ borked audio support); note to self – try out this  novnc Jupyter extension.): if you wrap an application that has a command line interface using metakernel, you can access it in a notebook, or JupyterLab.

Obvious, right? But that means I can also access it via a web page using something like ThebeLab (or Juniper, or nbinteract), run via a container launched using Binderhub.

This is all tied up with a couple of the Big Ideas that underlies Jupyter: firstly, that it supports the read/write web. Secondly that it supports remote code execution (and as such enables the read/write/execute web).

So for example, one of the many metakernel based kernels is the gnuplot_kernel that lets you run Gnuplot commands from a notebook code cell and display the generated figure in a notebook. Here’s a forked version with the repo tweaked so it runs on MyBinder.

Using a gnuplot_kernel enabled Binder repo, we can now run Gnuplot commands via a web-browser using the ThebeLab Javascript package, for example, and display the result in the same web page. The container on the back is fired up in response to the first command issued from the page, which make take up to a minute or two, and will be used for future commands issued from the page in the same session.

Here’s what it looks like:

(The Gnuplot code is ripped from an example in the Gnuplot docs / gallery.)

The code seems to be repeated in the output, but I guess a tweak to the ThebeLab settings, or code, may fix that. Or maybe the kernel needs a tweak. But the proof of concept is there…

Here’s the code for the web page (image file, sorry… WordPress-com editor’n’sourcecode support sucks and I get fed up faffing around with tag brackets each time I re-edit the page):

That source code image does make a second point, though… Look closely, and compare the URLs in the two images above: I can edit an HTML file via the Jupyter notebook text file editor, and also render the page as a served HTML file.

So that’s a couple more things for my colleagues to say “ah, but it won’t work for my course because…”

Bring it on…

PS the code as a gist:

Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is authentic, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, it doesn’t matter… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on visual analysis as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value y = sin(x) from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory \User\tony\f1\data, I would launch my terminal, enter cd \User\tony\f1\data or the equivalent to move to that directory, and then start gnuplot there using the command gnuplot):

gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The set xrange [1:58] command sets the range of the axis (there are 58 laps in a race, hence those settings.) The xlabel and ylabel settings are hopefully self-explanatory (they define axis labels). The set datafile separator "," command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like this, “this, that”, the other, the “this,that” string is detected as a single column/cell value, and not as two columns with cell values “this and that”? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20’s data starts on line 26 and so on…

To plot from a data file, we use the command plot 'turlapTimeFuel.csv' (that is, plot ‘filename). To pull data from columns 3 and 5, we use the subcommand using 3 (x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:
gnuplot> plot 'turlapTimeFuel.csv' using 3

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form using y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form using x:y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:3, or Fuel Adjusted Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:4. The command plot ‘turlapTimeFuel.csv’ using 2:3 gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. every::12:23 for rows 12 to 23), but
plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):
gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines

How do we read this? The command is actually of the form using x:y, but we do a bit of work to choose a valid value of x. ($1==4 ? 2:1/0):3 says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement using 2:3, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the with lines command will join consecutive points using a line, to produce a line chart:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines

We can add an informative label using the title subcommand:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"

We can also plot two drivers’ times on the same chart using different lines:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:
gnuplot> plot 90+(58-x)*2.7*0.03

We can now overlay the drivers’ times and the fuel penalty on the same chart:
gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…
gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines

Doh! I should have changed the y-axis label…:-(

That is, plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines says: for rows applying to driver 1 rows where the value in column 1, ($1), equals (==) the driver number (1), $1==1), use the value in column 2 ($2) as the x-value and for the y-value use the value in column 3 ($3) minus 95+(58-$2)*2.7*0.03). Note that the (58-$2) fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History