F1Stats – A Prequel to Getting Started With Rank Correlations
I finally, finally made time to get started on my statistics learning journey with a look at some of the results in the paper A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing.
Note that these notes represent a description of the things I learned trying to draw on ideas contained within the paper and apply it to data I had available. There may be errors… if you spot any, please let me know via the comments.
The paper uses race classification data from the 2009 season, comparing F1 and NASCAR championships and claiming to explore the extent to which positions in practice and qualifying relate to race classification. I won’t be looking at the NASCAR data, but I did try to replicate the F1 stats which I’ll start to describe later on in this post. I’ll also try to build up a series of interactive apps around the analyses, maybe along with some more traditional print format style reports.
(There are so many things I want to learn about, from the stats themselves, to possible workflows for analysis and reporting, to interactive analysis tools that I’m not sure what order any of it will pan out into, or even the extent to which I should try to write separate posts about the separate learnings…)
As a source of data, I used my f1com megascraper that grabs classification data (along with sector times and fastest laps) since 2006. (The ergast API doesn’t have the practice results, though it does have race and qualifying results going back a long way, so it could be used to do a partial analysis over many more seasons). I downloaded the whole Scraperwiki SQLite database which I could then load into R and play with offline at my leisure.
The first result of note in the paper is a chart that claims to demonstrate the Spearman rank correlation between practise and race results, qualification and race results, and championship points and race results, for each race in the season. The caption to the corresponding NASCAR graphs explains the shaded region: “In general, the values in the grey area are not statistically significant and the values in the white area are statistically significant.” A few practical uses we might put the chart to come to mind (possibly!): is qualifying position or p3 position a good indicator of race classification (that is, is the order folk finish at the end of p3 a good indicator of the rank order in which they’ll finish the race?)?; if the different rank orders are not correlated, (people finish the race in a different order to the gird position), does this say anything about how exciting the race might have been? Does the “statistical significance” of the correlation value add anything meaningful?
So what is this chart trying to show and what, if anything, might it tell us of interest about the race weekend?
First up, the statistic that’s being plotted is Spearman’s rank correlation coefficient. There are four things to note there:
- it’s a coefficient: a coefficent is a single number that tends to come in two flavours (often both at the same time). In a mathematical equation, a coefficient is typically a constant number that is used as multiplier of a variable. So for example, in the equation t = 2 x, the x variable has the coefficient 2. Note that the coefficient may also be a parameter, as for example in the equation y= a.x (where the . means ‘multiply’, and we naturally read x as a dependent variable that is used to determine the value of y having been multiplied by the value of a). However, a coefficient may also be a particular number that characterises a particular relationship between two things. In this case, it characterises the degree of correlation between two things…
- The refrain “correlation is not causation” is one heard widely around the web that mocks the fact that just because two things may be correlated – that is, when one thing changes, another changes in a similar way – it doesn’t necessarily mean that the way one thing changed caused the other to change in a similar way as a result. (Of course, it might mean that…;-). (If you want to find purely incidental correlations between things, have a look at Google correlate, that identifies different search terms whose search trends over time are similar. You can even draw your own trend over time to find terms that have trended in a similar way.)
Correlation, then, describes the extent to which two things tend to change over time in a similar way: when one goes up, the other goes up; when one goes down, the other goes down. (If they behave in opposite ways – if one goes up steeply the other goes down steeply; if one goes up gently, the other goes down gently – then they are negatively or anti-correlated).
Correlation measures require that you have paired data. You know you have paired data if you can plot your data as a two dimensional scatterplot and label each point with the name of the person or thing that was measured to get the two co-ordinate values. So for example, on a plot of qualifying position versus race classification, I can label each point with the name of the driver. The qualification position and race classification is paired data around the set of drivers.
- A rank correlation coefficient is used to describe the extent to which two rankings are correlated. Ranked orders do away with the extent to which different elements in a series differ from each other and just concentrate on their rank order. That is, we don’t care how much change there is in each of the data series, we’re just interested in rank positions within each series. In the case of an F1 race, the distribution of laptimes during qualifying may see the first few cars separated by a few thousandths of a second, but the time between the best laptimes of consecutively placed cars at the back of the grid might be of the order of tenths of a second. (Distributions take into account, for example, the variety and range of values in a dataset.) However, in a rank ordered chart, all we are interested in is the integer position: first, second, third, …, nineteenth, twentieth. There is no information about the magnitude of the actual time difference between the laptimes, that is, how far close to or far apart from each other the laptimes of consecutively placed cars were, we just know the rank order of fastest to slowest cars. The distribution of the rank values is not really very interesting, or subject to change, at all.
One thing that I learned that’s possibly handy to know when decoding the jargon: rank based stats are also often referred to as non-parametric statistics because no assumptions are made about how the numbers are distributed (presumably, there are no parameters of note relating to how the values are distributed, such as the mean and standard deviation of a “normal” distribution). If we think about the evolution of laptimes in a race, then most of them will be within a few tenths of the fastest lap each lap, with perhaps two bunches of slower lap times (in-lap and out-lap around a pitstop). The distribution of these lap times may be interesting (for example, the distribution of laptimes on a lap when everyone is racing will be different to the distribution of lap times on a lap when several cars are pitting). On the other hand, for each lap, the distribution of the rank order of laptimes during that lap will always be the same (first fastest, second fastest, third fastest, etc.). That is not to say, however, that the rank order of the drivers’ lap times does not change lap on lap, which of course it might do (Webber might be tenth fastest on lap 1, fastest on lap 2, eight fastest on lap three, and so on).
Of course, this being stats, “non-parametric” probably means lots of other things as well, but for now my rule of thumb will be: the distribution doesn’t matter (that is, the statistic does not make any assumptions about the distribution of the data in order for the statistic to work; which is to say, that’s one thing you don’t have to check in order to use the statistic (erm, I think…?!)
- The statistic chosen was Spearman’s rank correlation coefficient. Three differently calculated correlation coefficients appear to be widely used, (and also appear as possible methods in the R corr() function that calculates correlations between lists of numbers): i) Pearson’s product moment correlation coefficient (how well does a straight line through an x-y scatterplot of the data describe the relationship between the x and y values, and what’s the sign of its gradient); ii) Spearman’s rank correlation coefficient (also known as Spearman’s rho or rs); [this interactive is rather nice and shows how Pearson and Spearman correlations can differ]; iii) Kendall’s τ (that is, Kendall’s Tau; this coefficient is based on concordance, which describes how the sign of the difference in rank between pairs of numbers in one data series is the same as the sign of the difference in rank between a corresponding pair in the other data series.). Other flavours of correlation coefficient are also available (for example, Lin’s concordance correlation coefficient, as demonstrated in this example of identifying a signature collapse in a political party’s electoral vote when the Pearson coefficient suggested it had held up, which I think is used to test for how close to a 45 degree line the x-y association between paired data points is…).
The statistical significance test is based around the “null hypothesis” that the two sets of results are not correlated; the result is significant if they are more correlated than you might expect if both are randomly ordered. this made me a little twitchy: wouldn’t it be equally valid to argue that F1 is a procession and we would expect the race position and grid position to be perfectly correlated, for example, and then define our test for significance on the extent to which they are not?
This comparison of Pearson’s product moment and Spearman’s rank correlation coefficients helped me get a slightly clearer view of the notion of “test” and how both these coefficients act as tests for particular sorts of relationship. The Pearson product moment coefficent has a high value if a strong linear relationship holds across the data pairs. The Spearman rank correlation is weaker, in that it is simply looking to see whether or not the relationship is monotonic (that is, things all go up together, or they all go down together, but the extent to which they do so need not define a linear relationship, which is an assumption of the Pearson test.). In addition, when defining the statistical significance of the test, this is dependent on particular assumptions about the distribution of the data values, at least in the case of the Pearson test. The statistical significance relates to how likely the correlation value was assuming a normal distribution in the values within the paired data series (that is, each series is assumed to represent a normally distributed set of values).
If I understand this right, it means we separate the two things out: on the one hand, we have the statistic (the correlation coefficient); on the other, we have the significance test, which tells you how likely that result us given a set of assumptions about how the data is distributed. A question that then comes to mind is this: is the definition of the statistic dependent on a particular distribution of the data in order for the statistic to have something interesting to say, or is it just the significance that relies on that distribution. To twist this slightly, if we can’t do a significance test, is the statistic then essentially meaningless (because we don’t know whether those values are likely to occur whatever the relationships (or even no relationship) between the data sets?). Hmm.. maybe a statistic is actually a measurement in the context of its significance given some sort of assumption about how likely it is to occur by chance?
As far as Spearman’s rank correlation coefficient goes, I was little bit confused by the greyed “not significant” boundary on the diagram shown above. The claim is that any correlation value in that grey area could be accounted for in many cases by random sorting. Take two sets of driver numbers, sort them both randomly, and much of the time the correlation value will fall in that region. (I suspect this is not only arbitrary but misleading? If you have random orderings, is the likelihood that the correlation is in the range -0.1 to 0.1 the same as the likelihood that it will be in the range 0.2 to 0.4? Is the probability distribution of correlations “uniform” across the +/- 1 range?) Also, my hazy vague memory is that the population size affects the confidence interval (see also Explorations in statistics: confidence intervals) – isn’t this the principle on which funnel plots are built? The caption to the figure suggests that the population size (the “degrees of freedom”) was different for different races (different numbers of drivers). So why isn’t the shaded interval differently sized for those races?
Something else confused me about the interval values used to denote the significance of the Spearman rho values – where do they come from? A little digging suggested that they come from a table (i.e. someone worked them out numerically, presumably by generating looking at the distribution of different random rank orderings, rather than algorithmically – I couldn’t find a formula to calculate them? I did find this on Sample Size Requirements for Estimating Pearson, Kendall and Spearman Correlations by D Bonett (Psychometrika Vol. 65, No. 1, 23-28, March 2000) though). A little more digging suggested Significance Testing of the Spearman Rank Correlation Coefficient by J Zar (Journal of the American Statistical Association, Vol. 67, No. 339 (Sep., 1972), pp. 578- 580) as a key work on this, with some later qualification in Testing the Significance of Kendall’s τ and Spearman’s rs by M. Nijsse (Psychological Bulletin, 1988, Vol. 103, No. 2,235-237).
Hmmm.. this post was supposed to be about running the some of the stats used in A Tale of Two Motorsports: A Graphical-Statistical Analysis of How Practice, Qualifying, and Past SuccessRelate to Finish Position in NASCAR and Formula One Racing on some more recent data. But I’m well over a couple of thousand words into this post and still not started that bit… So maybe I’ll finish now, and hold the actual number crunching over to the next post…
PS I find myself: happier that I (think I) understand a little bit more about the rationale of significance tests; just as sceptical as ever I was about the actual practice;-)