A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is *authentic*, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, *it doesn’t matter*… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on *visual analysis* as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value *y = sin(x)* from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory *\User\tony\f1\data*, I would launch my terminal, enter `cd \User\tony\f1\data` or the equivalent to move to that directory, and then start gnuplot there using the command `gnuplot`):

`gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","`

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The `set xrange [1:58]` command sets the range of the axis (there are 58 laps in a race, hence those settings.) The `xlabel` and `ylabel` settings are hopefully self-explanatory (they define axis labels). The `set datafile separator ","` command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like *this, “this, that”, the other*, the *“this,that”* string is detected as a single column/cell value, and *not* as two columns with cell values *“this* and *that”*? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime 25,1,106.951,102.334,9.73 25,2,99.264,94.728,2.124 ... 25,55,94.979,94.574,1.97 25,56,95.083,94.759,2.155 20,1,103.531,98.914,6.959 20,2,97.370,92.834,0.879 ...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20’s data starts on line 26 and so on…

To plot from a data file, we use the command `plot 'turlapTimeFuel.csv'` (that is, *plot ‘ filename‘*). To pull data from columns 3 and 5, we use the subcommand

`using 3`(x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:

`gnuplot> plot 'turlapTimeFuel.csv' using 3`

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form *using y*:

`gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3`

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form *using x:y*:

`gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3`

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using *plot ‘turlapTimeFuel.csv’ using 1:3*, or Fuel Adjusted Laptime against driver using *plot ‘turlapTimeFuel.csv’ using 1:4*. The command *plot ‘turlapTimeFuel.csv’ using 2:3* gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. *every::12:23* for rows 12 to 23), but

plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):

`gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines`

How do we read this? The command is actually of the form *using x:y*, but we do a bit of work to choose a valid value of x. *($1==4 ? 2:1/0):3* says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement *using 2:3*, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the *with lines* command will join consecutive points using a line, to produce a line chart:

` plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines`

We can add an informative label using the *title* subcommand:

`plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"`

We can also plot two drivers’ times on the same chart using different lines:

`plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)", 'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "`

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:

`gnuplot> plot 90+(58-x)*2.7*0.03`

We can now overlay the drivers’ times and the fuel penalty on the same chart:

`gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03`

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…

`gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines`

*Doh! I should have changed the y-axis label…:-(*

That is, *plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines* says: for rows applying to driver 1 rows where the value in column 1, (*$1*), equals (==) the driver number (1), *$1==1*), use the value in column 2 (*$2*) as the x-value and for the y-value use the value in column 3 (*$3*) minus 95+(58-$2)*2.7*0.03). Note that the *(58-$2)* fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History