*Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.*

The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.

(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)

As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.

The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?

So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?

First up, the statistic that’s being plotted is *Spearman’s rank correlation coefficient*. There are four things to note there:

- it’s a
**coefficient**: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation`t = 2 x`, the`x`variable has the coefficient`2`. Note that the coefficient may also be a parameter, as for example in the equation`y= a.x`(where the . means ‘multiply’, and we naturally read`x`as a dependent variable that is used to determine the value of`y`having been multiplied by the value of`a`). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of*correlation*between two things… - The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed
*caused*the other to change in a similar way as a result. (Of course, it*might*mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)**Correlation**, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are*negatively*or*anti*-correlated).

Correlation measures require that you have*paired*data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers. - A
is used to describe the extent to which two**rank**correlation coefficient*rankings*are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the*distribution*of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (*Distributions*take into account, for example, the variety and range of values in a dataset.) However, in a*rank ordered*chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The*distribution*of the rank values is not really very interesting, or subject to change, at all.

One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as*non-parametric statistics*because no assumptions are made about how the numbers are*distributed*(presumably, there are no*parameters*of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).

Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be:*the distribution doesn’t matter*(that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!) - The statistic chosen was
. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R**Spearman’s**rank correlation coefficient`corr()`function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as*Spearman’s rho*or r_{s}); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on*concordance*, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example,*Lin’s concordance correlation coefficient*, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).

The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?

This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).

If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us *given a set of assumptions about how the data is distributed*. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?

As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). *So why isn’t the shaded interval differently sized for those races?*

Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come *from a table* (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested *Significance Testing of the Spearman Rank Correlation Coefficient* by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in *Testing the Significance of Kendall’s τ and Spearman’s r _{s}* by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).

Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…

PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)

]]>So this year I’m going to try to do some stats, and use some stats, and see if I can find out from personal need and personal interest whether they lead me to any insights about, or stories hidden within, various data sets I keep playing with. So things like: looking for patterns or trends, looking for outliers, and comparing one thing with another. If I can find any statistics that appear to suggest particular betting odds look particularly favourable, that might be interesting too. (As Nate Silver suggests, betting, even fantasy betting, is a great way of keeping score…)

Note that what I hope will turn into a series of posts should not be viewed as tutorial notes – they’re far more likely to be akin to student notes on a problem set the student is trying to work through, without having attended any of the required courses, and without having taken the time to read through a proper tutorial on the subject. Nor do I intend to to set out with a view to learning particular statistical techniques. Instead, I’ll be dipping into the world of stats looking for useful tools to see if they help me explore particular questions that come to mind and then try to apply them cut-and-past fashion, which is how I approach most of my coding!

Bare naked learning, in other words.

So if you thought I had any great understanding about stats – in fact, any understanding at all – I’m afraid I’m going to disabuse you of that notion. As to my use of the R statistical programming language, that’s been pretty much focussed on using it for generating graphics in a hacky way. (I’ve also found it hard, in the past, plotting pixels on screen and page in a programmable way, but R graphics libraries such as ggplot2 make it really easy at a high level of abstraction…:-)

That’s the setting then… Now: #f1stats. What’s that all about?

Notwithstanding the above (that this isn’t about learning a particular set of stats methods defined in advance) I did do a quick trawl looking for “F1 stats tutorials” to see if there were any that I could crib from directly; but my search didn’t turn up much that was directly and immediately useful (if you know of anything that might be, please post a link in the comments). There were a few things that looked like they might be interesting, so here’s a quick dump of the relevant…

- First up, I’ve been reading Nate Silver’s The Signal and the Noise, which mentions the aging stats and aging models for baseball players. I found a paper on The Age Productivity Gradient: Evidence from a sample of F1 Drivers, which hasn’t got too many scary equations in, so I may try to replicate that and then bring the models up to date (the paper is dated 2009). It would have been so nice if the authors had published code equivalents in R that I could have played with directly, but I haven’t been able to find it if they did. I also found a paper on Estimated Age Effects in Baseball, again with equations but no code, but it may provide additional clues. From a quick skim, I think there may be some mileage in trying to get my head round different ways of comparing
*rankings*. - A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past Success Relate to Finish Position in NASCAR and Formula One Racing is perhaps an easier thing to try to copy for starters, though>
- The article The wisdom of ignorant crowds: predicting sport outcomes by mere recognition explores a simple tournament winner predicting strategy based on how recognisable the names of competitors are. (I guess social media metrics might be a proxy for recognition? Hmm.. could test that I suppose with reference to this paper?) One thing that caught my eye were a couple of simple schemes for benchmarking different prediction models, which might be something I could pull on if I end up exploring prediction models?
- NASCAR results have featured in several papers (I think there’s also a NASCAR dataset available in R?) so I’ll probably try dipping in to them at some point to see if I can do similar things with F1 data. For example, an analysis of NASCAR Winston Cup Race Results for 1975-2003; a couple of papers on hierarchical modelling of auto-racing results; and some Bayesian inference stuff that I guess is really beyond me for now and that I really really could do with a pre-built R libraries for;
- an MSc thesis I’ve referred to before on Prediction of Formula One Race Results Using Driver Characteristics has some handy ideas that I might be able to draw on if I have a look at laptime data;
- One of the the things I’ve been pondering is ways of ranking drivers based on fast lap times (eg during qualifying, vs. during the race). Although not about motor sport, or any sort of racing, A New Method for Ranking Total Driving Performance on the PGA Tour has a metric I may be able to bastardise in a Formula One context. The same periodical also has an article on Do Reliable Predictors Exist for the Outcomes of NASCAR Races?, the techniques of which might be applicable to F1? Predicting The Outcome Of NASCAR Races: The Role Of Driver Experience looks to be in a similar vein too…
- A paper on Outcome Uncertainty in NASCAR looks at how attendance and TV audience figures are influenced by race expectations, which might be something that could also be explored in context of UK F1 TV audience figures. That said, the notion of “outcome uncertainty” itself, and related measures, might also be worth exploring in their own right?

If you know of any other relevant looking papers or articles, please post a link in the comments.

[MORE LINKS…

– Who is the Best Formula 1 Driver? An Econometric Analysis

]

I was hoping to finish this post with a couple of quick R hacks around some F1 datasets, but I’ve just noticed that today, as in yesterday, has become tomorrow, as in today, and this post is probably already long enough… So it’ll have to wait for another day…

PS FWIW, I also note the arrival of the Sports Analytics Innovation Summit in London in March… I doubt I have the impact required to make it as a media partner though… Although maybe OpenLearn does…?!

]]>