Plotting Tabular (CSV) Data and Algebraic Expressions On the Same Graph Using Gnuplot

A couple of the reasons why I’ve been making so much use of Formula 1 data for visualisations lately are that: a) the data is authentic, representing someone’s legitimate data but presented in a way that is outside my control (so I have to wrangle with scraping, representation and modeling issues); b) if I get anything wrong whilst I’m playing, it doesn’t matter… (Whereas if I plotted some university ranking tables and oopsed to show that all the students at X were unhappy, its research record and teaching quality were lousy, and it was going to be charging 9k fees on courses with a 70% likelihood of closing most of them, when in fact it was doing really well, I might get into trouble…;-)

I’ve also been using the data as a foil for finding tools and applications that I can use to create data visualisations that other people might be interested in trying out too. There is a work-related rationale here too: in October, I hope to run a “MOOC”, (all you need to know…) on visual analysis as a live development/production exercise for a new short course that will hopefully be released next year, and part of that will involve the use of various third party tools and applications for the hands-on activities.

One of the issues I’ve recently faced is how to plot a chart that combines tabulated data imported from a CSV file with a line graph plotted from an equation. My ideal tool would be a graphical environment that lets me import data and plot it, and then overlay a plot generated from a formula or equation of my own. Being able to apply a function to the tabulated data (for example, remove a value y = sin(x) from the tabular data would be ideal, but not essential.

In this post, I’ll describe one tool – Gnuplot – that meets at least the first requirement, and show how it can be used to plot some time series data from a CSV file overlaid with a decreasing linear function. (Which is to say, how to plot F1 laptime data against the fuel weight time penalty, (the amount of time that the weight of the fuel in the car slows the car down by… For more on this, see F1 2011 Turkey Race – Fuel Corrected Laptimes.)

I’ve been using Gnuplot off and on for a couple of decades(?!), though I’ve really fallen out of practice with it over the last ten years… (not doing research!;-)

The easiest way of using the tool is to launch it in the directory where your data files are stored. So for example, if the data file resides in the directory \User\tony\f1\data, I would launch my terminal, enter cd \User\tony\f1\data or the equivalent to move to that directory, and then start gnuplot there using the command gnuplot):

gnuplot> set term x11
Terminal type set to 'x11'
gnuplot> set xrange [1:58]
gnuplot> set xlabel "Lap"
gnuplot> set ylabel "Fuel Weight Time Penalty"
gnuplot> set datafile separator ","

For some reason, my version of Gnuplot (on a Mac), wouldn’t display any graphs till i set the output to use x11… The set xrange [1:58] command sets the range of the axis (there are 58 laps in a race, hence those settings.) The xlabel and ylabel settings are hopefully self-explanatory (they define axis labels). The set datafile separator "," command prepares Gnuplot to load in a file formatted as tabular data, one row per line, with commas separating the columns (I assume if you pass in something like this, “this, that”, the other, the “this,that” string is detected as a single column/cell value, and not as two columns with cell values “this and that”? I forget…)

The data file I have is not as clean as it might be. (If you want to play along, the data file is here). It’s arranged as follows:

Driver,Lap,Lap Time,Fuel Adjusted Laptime,Fuel and fastest lap adjusted laptime
25,1,106.951,102.334,9.73
25,2,99.264,94.728,2.124
...
25,55,94.979,94.574,1.97
25,56,95.083,94.759,2.155
20,1,103.531,98.914,6.959
20,2,97.370,92.834,0.879
...

That is, there is one row per driver lap. Each driver’s data is on a consecutive line, in increasing lap number, so driver 25 is on lines 1 to 56, driver 20’s data starts on line 26 and so on…

To plot from a data file, we use the command plot 'turlapTimeFuel.csv' (that is, plot ‘filename). To pull data from columns 3 and 5, we use the subcommand using 3 (x would count incrementally, so we get a plot of column 3 against increasing row number) from the command:
gnuplot> plot 'turlapTimeFuel.csv' using 3

To plot from just a range of numbers (e.g. rows 0 to 57 (the header row is ignored), against row number, we can use a subcommand if the form using y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 3

To specify the x value (e.g. to plot column 3 as y against column 2 as x), we use a subcommand of the form using x:y:
gnuplot> plot 'turlapTimeFuel.csv' every::0::57 using 2:3

(The first column is column 1; I think the first row is row 0…)

So for example, we can plot Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:3, or Fuel Adjusted Laptime against driver using plot ‘turlapTimeFuel.csv’ using 1:4. The command plot ‘turlapTimeFuel.csv’ using 2:3 gives us a plot of each driver’s laptime against lap number.

But how do we plot the data for just a single driver? We saw how to plot against a consecutive range of row values (e.g. every::12:23 for rows 12 to 23), but
plotting the laptime data for each driver this way is really painful (we have to know which range of row numbers the data we want to plot are on). Instead we can filter out the data according to driver number (the column 1 values):
gnuplot> plot 'turlapTimeFuel.csv' using ($1==4 ? $2:1/0):3 with lines

How do we read this? The command is actually of the form using x:y, but we do a bit of work to choose a valid value of x. ($1==4 ? 2:1/0):3 says “if the value of column 1 ($1) equals 4, then (?) select column 2, otherwise/else (the first “:”), forget it (1/0 is one divided by zero, a number intensely disliked by gnuplot that says in this context, do nothing more with this row…). If the value of column 1 does equal 4, then we create a valid statement using 2:3, otherwise we ignore the row and the data in columns 2 and 3. The whole statement thus just plots the data for driver 4.

Rather than plot points, the with lines command will join consecutive points using a line, to produce a line chart:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines

We can add an informative label using the title subcommand:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)"

We can also plot two drivers’ times on the same chart using different lines:
plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==2 ? $2:1/0):3 with lines title "WEB "

We can also plot functions. In the following case, I plot the time penalty applied to a car for each lap on the basis of how much more fuel it is carrying at the start of the race compared to the end:
gnuplot> plot 90+(58-x)*2.7*0.03

We can now overlay the drivers’ times and the fuel penalty on the same chart:
gnuplot> set yrange [85:120]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):3 with lines title "(unadjusted) VET Su | Su(11) Su(25) Hn(40) Hn(47)",'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):4 with lines title "VET Su | Su(11) Su(25) Hn(40) Hn(47)",90+(58-x)*2.7*0.03

It’s also possible to do sums on the data. If you read $4 as “use the value in column 4 of the current row”, you can start to guess at creating things like the following, which in the first part plots cells in the laptime column 3 modified by the fuel penalty. (I also plot the pre-calculated fuel adjusted laptime data from column 5 as a comparison. The only difference in values is the offset…
gnuplot> set yrange [-1:20]
gnuplot> plot 'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines,'turlapTimeFuel.csv' using ($1==1 ? $2:1/0):5 with lines


Doh! I should have changed the y-axis label…:-(

That is, plot ‘turlapTimeFuel.csv’ using ($1==1 ? $2:1/0):($3-(88+(58-$2)*2.7*0.03)) with lines says: for rows applying to driver 1 rows where the value in column 1, ($1), equals (==) the driver number (1), $1==1), use the value in column 2 ($2) as the x-value and for the y-value use the value in column 3 ($3) minus 95+(58-$2)*2.7*0.03). Note that the (58-$2) fragment subtracts the lap number (as contained in value in the column 2 ($2) cell) from the lap count to work out how many more laps worth of fuel the car is carrying in the current lap than at the end of the race.

So – that’s a quick tour of gnuplot, showing how it can be used to plot CSV data and an algebraic expression on the same graph, how to filter the data plotted from the CSV file using particular values in a specified column, and how to perform a mathematical operation on the data pulled in from the CSV file before plotting it (and without changing it in the original file).

Just in passing, if you need an online formula plotter, Wolfram Alpha is rather handy… it can do calculus for you too…

PS In a forthcoming post, I’ll describe another tool for creating similar sorts of plot – GeoGebra. If you know of other free, cross-platform, ideally open source, hosted or desktop applications similar to Gnuplot or GeoGebra, please post a link in the comments:-)

PPS I quite like this not-so-frequently-asked questions cribsheet for Gnuplot

PPPS for another worked through example, see F1DataJunkie: F1 2011 Turkey Race – Race Lap History

Idle Thoughts – A Few More Approaches to Making CSV Files Queryable

How much more useful would CSV data be if it was queryable?

In a previous post (Using CSV Docs As a Database) I described a recipe for importing a CSV file into a Google spreadsheet so that the data it contained could be queried using the Google visualisation API. Whilst there isn’t a lot of technical knowledge required to republish the data in this way, there is still the overhead of requiring the user to log in to a Google account, create a spreadsheet, import the CSV document and then discover the spreadsheet URL. And bearing in mind the rule of thumb that two clicks is at least one click too many, this route to making documents republishable is likely to be seen as too complicated for many people.

So here are a few partial ideas about other ways in which we might be able to (re)publish CSV documents so that they become queryable. (That is, lightweight ways of providing a query interface to a CSV doc.)

Firstly, I wonder whether or not it would be possible to tweak the example Google Chart tools data source implementation example so that a user could just upload or link an already online CSV document as an external data source? The example provides a tutorial for how to use a CSV document as an external datasource by wrapping it with a (provided) Java library that implements the Google data source API; so I wonder: could this example be tweaked so that any CSV files uploaded to or placed in a specified directory/folder could be selected as the external datasource, meaning that a council officer could expose a queryable interface to a CSV document simply by pasting a copy of the CSV document into a particular folder? Or maybe modifying the example to become a service such that if it it was passed a link to an online CSV doc, it would allow that document to be treated as the external datasource and provide the query interface to it?

Secondly, it seems to me that YQL also offers a query interface to arbitrary online CSV documents? The post Analyzing World Cup Data with YQL on the Yahoo Developer blog gives an example of how to use YQL to write queries over a CSV document (although rather perversely they use a Google spreadsheet as the CSV source!) but we can presumably do a similar thing with data listed on data.gov.uk for example. Trying out the:
select * from csv where url=’http://example.com/whatever.csv’
approach on several URLs through an error (“Unable to parse data using default charset utf-8”), though apparently it is possible to force the select … from csv handler to accept a specified charset; (there are two problems that then arise for me at least: i) I’m not sure how to specify the charset; ii) I don’t know how to detect the charset of the files that are apparently not being recognised by YQL as utf-8. It does make me think, though, that guidance about setting charsets (maybe as the server specified MIME-TYPE for CSV files? Or am I talking b****ks? If not, maybe this Google apps script URLFetch function might help?!) may be required as part of the gotcha guidelines for publishing online open data?

Anyway, here’s an example of a CSV file indexed on data.gov.uk that I could query – Bournemouth Libraries:

YQL querying a CSV file

That is:
select * from csv where url=’http://www.bournemouth.gov.uk/library/data/libraries.csv’ and col1 like ‘%BH8%’

(A list of the other query filters supported within the YQL SELECT … WHERE statement can be found here: Filtering Query Results (WHERE).)

Although I forget how at the moment, I seem to remember that it is possible to create parameterised short URL queries over YQL, so for example I could presumably now create a short URL query along the lines of http://example.com/yql/bournmeouthLibraries?postcode=???? that would take in the first half of a postcode, for example, and then query the Bournemouth libraries CSV document as a database for libraries whose addresses specify that postcode area?

Thirdly. I almost feel obliged to demonstrate a Yahoo Pipes implementation… so without any real explanation, how about the following?

QUerying a CSV doc in Yahoo pipes

(I suspect that for large CSV files, the only solution that would work would be the Google VIz API external datasource example. I know Yahoo Pipes borks on large HTML files, and I’ve had YQL time out trying to query a large XML file before now…)

[Note that this is a scheduled post and that I am on holiday at the moment – away from the net and without a computer to hand. Which is why I haven’t tried out any of the above, and don’t intend to for at least a week or two…;-)]

PS see also this bit of @codepo8 magic http://isithackday.com/csv-to-webservice/ (reminded of this via @OSandCMS)