Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

C_t =  \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}

where C_t is the churn in team standings for year t, \left|f_{i,t} - f_{i,t-1}\right| is the absolute value of the i-th team’s change in finishing position going from season t-1 to season t, and N is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, C_t, by the maximum churn, C_max. The value of the maximum churn depends on whether there is an even or odd number of competitors:

C_{max} = N/2 \text{, for even N}

C_{max} = (N^2 - 1) / 2N \text{, for odd N}

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, f_{i,t} is the position of driver i at the end of race t, f_{i,t-1} is the starting position of driver i at the beginning of that race (that is, race t) and N is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
  if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

  • comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
  • churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
  • churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid, 
                position, positionText, positionOrder
          FROM results rs JOIN drivers d JOIN races r
          ON rs.driverId=d.driverId AND rs.raceId=r.raceId
          WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
            churn = churn(delta),
            adjchurn = adjchurn(delta)
            )

For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.