## Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching

One of the reasons I dive into motorsport results and timing data every so often is that it gives me a quite limited set of data to play with. In turn, this means I have to get creative when it comes to reshaping the data to see what visuals I can pull out of it, as generating derived datasets to see what other story forms and insights might be hidden in there.

One of the things I hope to do with the WRC data is push a bit more on automatically generating text-based race reports from the data. Part of the trick here is spotting patterns that can be be mapped onto textual tropes, common sorts of phrase or sentence that you are likely to see in the more vanilla forms of sports reporting. (“X led the race from the start”, “Despite a poor start to the stage, Y went on to win it, N seconds ahead of Z in second place” and so on.)

So how can we spot the patterns? One way is to write a SQL query that detects a particular pattern in the data and uses that to flag a possible event (for example, Detecting Undercuts in F1 Races Using R). Another might be to cast the data as a graph and then detect features using graph based algorithms (eg Identifying Position Change Groupings in Rank Ordered Lists).

During the middle of last night, I woke up wondering whether or not it would be possible to cast simple feature components as symbols and then use a regular expression pattern matcher to identify a particular sort of pattern from a symbolic string. So here’s a quick proof of concept…

From the WRC Monte Carlo 2107 rally, stage 3, some split times and rank positions at each split.

Here’s a visual representation of the same (the number labels are rank position at each split, the y-axis is the delta to the fastest time recorded over that split (the “sector time”, if you will, derived data from the original results data).

For each driver, you may be able to spot several shapes. For example, Ogier is way behind at the first split, but then gains over the rest of the stage, Kreeke and Breen lose time at the second split, Hanninen loses it on the final part of the stage, and so on. Can we code for these different patterns, and then detect them?

So that seems to work okay… Now all I need to do is come up with some suitable symbolic encodings and pattern matching strings…

Hmmm… Vague memories… I wonder if there are any symbolic dynamics algorithms or finite state machine grammar parsers I could make use of?

## Tata F1 Connectivity Innovation Prize, 2015 – Mood Board Notes

It’s probably the wrong way round to do this – I’ve already done an original quick sketch, after all – but I thought I’d put together a mood board collection of images and design ideas relating to the 2015 Tata F1 Connectivity Innovation Prize to see what else is current in the world of motorsport telemetry display design just in case I do get round to entering the competition and need to bulk up the entry a bit with additional qualification…

First up, some imagery from last year’s challenge brief – and F1 timing screen; note the black background the use of a particular colour palette:

In the following images, click through on the image to see the original link.

How about some context – what sort of setting might the displays be used in?

From flatvision case study of the Lotus F1 pit wall basic requirements include:

• sunlight readable displays to be integrated within a mobile pit wall;
• a display bright enough to be viewed in all light conditions.

The solution included ‘9 x 24” Transflective sunlight readable monitors, featuring a 1920×1200 resolution’. SO that fives some idea of real estate available per screen.

So how about some example displays…

The following seems to come from a telemetry dash product:

There’s a lot of text on that display, and what also looks like timing screen info about other cars. The rev counter uses bar segments that increase in size (something I’ve seen on a lot of other car dashboards). The numerbs are big and bold, with units identifying what the value relates to.

The following chart (the engineer’s view from something called The Pitwall Project) provides an indication of tyre status in the left hand column, with steering and pedal indicators (i.e. driver actions) in the right hand column.

Here’s a view (from an unknown source) that also displays separate tyre data:

Another take on displaying the wheel layout and a partial gauge view in the right hand column:

Or how about relating tyre indicator values even more closely to the host vehicle?

This Race Technology Monitor screen uses a segmented bar for the majority of numerical displays. These display types give a quantised display, compared to the continuously varying numerical indicator. The display also shows historical traces, presumably of the corresponding quantity?

The following dashes show a dial rich view compared to a more numerical display:

The following sample dash seems to be a recreation for a simulation game? Note the spectrum coloured bar that has a full range outline, and the use of colour in the block colour background indicators. Note also the combined bar and label views (the label in the mid-point of the bar – which means is may have to straddle two differently coloured backgrounds.

The following Sim Racing Data Analysis display uses markers on the bars to identify notable values – max and min, perhaps? Or max and optimal?

It also seems like there are a few mobile apps out there doing dashboard style displays – this one looks quite clean to me and demonstrates a range of colour and border styles:

Here’s another app – and a dial heavy display style:

Finally, some views on how to capture the history of the time series. The first one is health monitoring data – as you;d expect from health-monitoring related displays, it’s really clean looking:

I’m guessing the time-based trace goes left to right, but for our purposes, streaming the history from right to left, with the numerical indicator essentially bleeding into the line chart display, could work?

This view shows a crude way of putting several scales onto one line chart area:

This curiosity is from one of my earlier experiments – “driver DNA”. For each of the four bands, lap count is on the vertical axis, distance round lap on the horizontal axis. The colour is the indicator value. The advantage of this is that you see patterns lap on lap, but the resolution of the most current value in a realtime trace might be hard to see?

Some time ago, The Still design agency posted a concept pitwall for McLaren Mercedes. The images are still lurking in the Google Image search cache, and include example widgets:

and an example tyre health monitor display:

To my eye, those numbers are too far apart (the display is too wide and likely occluded by the animated line charts), and the semantics of the streaming are unclear (if the stream flows from the number, new numbers will come in at the left for the left hand tyres and from the right for the right hand ones?

And finally, an example of a typical post hoc race data capture analysis display/replay.

Where do you start to look?!

PS in terms of implementation, a widget display seems sensible. Something like freeboard looks like it could provide a handy prototyping tool, or something like the RShiny dashboard backed up by JSON streaming support from jsonlite and HTML widgets wrapped by htmlwidgets.

## Keeping Track of an Evolving “Top N” Cutoff Threshold Value

In a previous post (Charts are for Reading), I noted how it was difficult to keep track of which times in an F1 qualifying session had made the cutoff time as a qualifying session evolved. The problem can be stated as follows: in the first session, with 20 drivers competing, the 15 drivers with the best ranked laptime will make it into the next session. Each driver can complete zero or more timed laps, with drivers completing laps in any order.

Finding the 15 drivers who will make the cutoff is therefore not simply a matter of ranking the best 15 laptimes at any point, because the same 5 drivers, say, may each record 3 fast laptimes, thus taking up the 15 slots that record the 15 fastest laptimes.

If we define a discrete time series with steps corresponding to each recorded laptime (from any driver), then at each time step we can find the best 15 drivers by finding each driver’s best laptime to date and ranking by those times. Conceptually, we need something like a lap chart which uses a ‘timed lap count’ rather than race lap index to keep track of the top 15 cars at any point.

At each index step, we can then find the laptime of the 15th ranked car to find the current session laptime.

In a dataframe that records laptimes in a session by driver code for each driver, along with a column that contains the current purple laptime, we can arrange the laptimes by cumulative session laptime (so the order of rows follows the order in which laptimes are recorded) and then iterate through those rows one at a time. At each step, we can summarise the best laptime recorded so far in the session for each driver.

df=arrange(df,cuml)
dfc=data.frame()
for (r in 1:nrow(df)) {
#summarise best laptime recorded so far in the session for each driver
dfcc=ddply(df[1:r,],.(qsession,code),summarise,dbest=min(stime))
#Keep track of which session we are in
session=df[r,]$qsession #Rank the best laptimes for each driver to date in the current session #(Really should filter by session at the start of this loop?) dfcc=arrange(dfcc[dfcc['qsession']==session,],dbest) #The different sessions have different cutoffs: Q1, top 15; Q2, top 10 n=cutoffvals[df[r,]$qsession]
#if we have at least as many driver best times recorded as the cutoff number
if (nrow(dfcc) >=n){
#Grab a record of the current cut-off time
#along with info about each recorded laptime
dfc=rbind(dfc,data.frame(df[r,]['qsession'],df[r,]['code'],df[r,]['cuml'],dfcc[n,]['dbest']) )
}
}

We can then plot the evolution of the cut-off time as the sessions proceed. The chart in it’s current form is still a bit hard to parse, but it’s a start…

In the above sketch, the lines connect the current purple time and the current cut-off time in each session (aside from the horizontal line which represents the cut-off time at the end of the session). This gives a false impression of the evolution of the cutoff time – really, the line should be a stepped line that traces the current cut-off time horizontally until it is improved, at which point it should step vertically down. (In actual fact, the line does track horizontally as laptimes are recorded that do not change the cuttoff time, as indicated by the horizontal tracks in the Q1 panel as the grey times (laptime slower than driver’s best time in session so far) are completed.

The driver labels are coloured according to: purple – current best in session time; green – driver best in session time to date (that wasn’t also purple); red – driver’s best time in session that was outside the final cut-off time. This colouring conflates two approaches to representing information – the purple/green colours represent online algorithmic processing (if we constructed the chart in real time from laptime data as laps we completed, that’s how we’d colour the points), whereas the red colouring represents the results of offline algorithmic processing (the colour can only be calculated at the end of the session when we know the final session cutoff time). I think these mixed semantics contribute to making the chart difficult to read…

In terms of what sort of stories we might be able to pull from the chart, we see that in Q2, Hulkenberg and Sainz were only fractions of a second apart, and Perez looked like he had squeezed in to the final top 10 slot until Sainz pushed him out. To make it easier to see which times contribute to the top 10 times, we could use font weight (eg bold font) to highlight each drivers session best laptimes.

To make the chart easier to read, I think each time improvement to the cutoff time should be represented by a faint horizontal line, with a slightly darker line tracing the evolution of the cutoff time as a stepped line. This would all us to see which times were within the cutoff time at any point.

I also wonder whether it might be interesting to generate a table a bit like the race lap chart, using session timed lap count rather than race lap count, perhaps with additional colour fields to show which car recorded the time that increased the lap count index, and perhaps also where in the order that time fell if it didn’t change the order in the session so far. We could also generate online and offline differences between each laptime in the session and the current cutoff time (online algorithm) as well as the final overall session cutoff time (offline algorithm).

[As and when I get this chart sorted, it will appear in an update to the Wrangling F1 Data With R lean book.]

## Charts are for Reading…

If charts are pictures, and every picture not only tells a story, but also saves a thousand words in doing so, how then are we to actually read them?

Take the following example, a quick #f1datajunkie sketch show how the Bahrain 2015 qualifying session progressed. The chart is split into three, one for each part of qualifying (which we might refer to as fractional sessions), which already starts to set the scene for the story. The horizontal x-axis is the time in seconds into qualifying at which each laptime is recorded, indexed against the first laptime recorded in qualifying overall. The vertical y-axis records laptimes in in seconds, limited to 107% of the fastest laptime recorded in a particular session. The green colour denotes a driver’s fastest laptime recorded in each fractional session, purple the overall fasted laptime recorded so far in a fractional session (purple trumps green). So again, the chart is starting to paint a picture.

An example of the sort of analysis that can be provided for a qualifying session can be found in a post by Justin Hynes, Lewis Hamilton seals his first Bahrain pole but Vettel poses the menace to Mercedes’ hopes, that appeared on he James Allen on F1 blog. In this post, I’ll try to match elements of that analysis with things we can directly see in the chart above…

[Hamilton] finish[ed] 0.411s clear of Ferrari’s Sebastian Vettel and more than half a second in front of his Mercedes team-mate Nico Rosberg

We don’t get the time gap exactly from the chart, but looking to the rightmost panel (Q3), finding the lowest vertical marks for HAM, VET and ROS, and imagining a horizontal line across to the y-axis, we get a feeling for the relative gaps.

Q1 got underway in slightly calmer conditions than blustery FP3 and Raikkonen was the first to take to the track, with Bottas joining the fray soon after. The Williams driver quickly took P1 but was then eclipsed by Rosberg, who set a time of 1: 35.657 on the medium tyres.

Q1 is the leftmost panel, in which we see RAI setting the first representative laptime at least (within the 107% limit of the session best overall), followed by BOT and then ROS improving on the early purple times.

The Mercedes man was soon joined in the top five by soft-tyre runners Nico Hulkenberg and Felipe Nasr.

HUL and NAS appear around the 300 cuml (cumulative laptime) mark. We note that PER is there in the mix too, but is not mentioned explicitly in the report.

In the closing stages of the session those in the danger zone were Max Verstappen, Pastor Maldonado and Will Stevens and Roberto Merhi.

On the right hand side of the chart, we see laps at the end of the session from MAL and VES (and way off the pace, STE). One problem with the chart as style above (showing cumulative best times in the session, makes it hard to see which a driver’s best session time overall actually is. (We could address this by perhaps displaying a driver’s session best time using a bold font.) The chart is also very cluttered around the cutoff time which makes it hard to see clearly who got through and who didn’t. And we don’t really know where the danger zone is because we have no clear indication of what the best 15 drivers’ times are – and hence, where the evolving cut-off time is…

Verstappen found the required pace and scraped into Q2 with a time of 1:35.611. Maldonado, however, failed to make it through, his best lap of 1:35.677 only being good enough for P16.

Verstappen’s leap to safety also pushed out Daniil Kvyat, with the Russian putting in a disappointing final lap that netted him P17 behind the Lotus driver. Hulkenberg was the last man through to Q2, the Force India driver’s 1:35.653 seeing him safely through with just two hundredths of a second in hand over Maldonado…

With an evolution of the cutoff time, and a zoom around the final cutoff time, we should be able to see what went on rather more clearly.

At the top of the order, Hamilton was quickest, finishing a tenth in front of Bottas. Rosberg was third, though he finished the session close on half a second down on his team-mate.

Felipe Massa was fourth for Williams, ahead of Raikkonen, Red Bull’s Daniel Ricciardo and Sebastian Vettel, who completed just three laps in the opening session. All drivers set their best times on the soft tyre.

This information can be quite clearly seen on the chart – aside from the tyre data which is not made available by the FIA.

The follow description of Q2 provides quite a straightforward reading of the second panel of the chart.

In the second session, Rosberg initially set the pace but Hamilton quickly worked his way back to the top of the order, his first run netting a time of 1:32.669. Rosberg was also again eclipsed by Massa who set a time three tenths of a second quicker than Rosberg’s.

The last to set an opening time were the Ferraris of Raikkonen and Vettel, though both rapidly staked a claim on a Q3 berth with the Finn in P2 and the German in P4.

Most of the front runners opted to rely on their first run to see them through and in the closing stages those in the drop zone were Hulkenberg, Force India team-mate Sergio Perez, Nasr, Sauber team-mate Ericsson and McLaren’s Fernando Alonso.

However, the chart does not clearly show how ROS’ early purple time was challenged by BOT, or how MAS early pace time was challenged mid-way through the session by VET and RAI.

Hulkenberg was the man to make the big move, claiming ninth place in Q2 with a time of 1:34.613. Behind him Toro Rosso’s Carlos Sainz scraped through in P10, six hundredths of a second clear of 11th-placed Sergio Perez. The Mexican was followed by Nasr and Ericsson. Alonso claimed P14, while 15th place went to the unfortunate Verstappen, who early in the session had reported that he was down on power.

Again, this reading of the chart would be aided by an evolving cut-off time line.

Looking now to the third panel…

The first runs in Q3 saw Hamilton in charge again, with the champion setting a time of 1:33.552 on used softs to take P1 three tenths of a second ahead of Red Bull’s Ricciardo, who prior to Hamilton’s lap had claimed the fastest S3 time of the session using new soft tyres.

Rosberg, also on used softs, was third, four thousandths of a second down on the Australian’s time. Hulkenberg, with just one new set of softs at his disposal, opted to sit out the first run.

The chart clearly shows the early and late session runs, and is reflected in the analysis:

In the final runs, Vettel was the first of the likely front-row men across the line and with purple times in S1 and S2, the German set a provisional pole time of 1:32.982. It was a superb lap but Hamilton was already running faster, stealing the S1 purple time from the German.

Ahead of the champion on track, Rosberg had similarly taken the best S2 time but he could not find more pace and when he crossed the line he slotted into third, four hundredths [??] of a second behind Vettel.

So what does Justin Hynes’ qualifying session commentary tell us about how we might be able to read the charted summary of the session? And how can we improve the chart to help draw out some of the stories? A couple of things jump out for me – firstly, the evolving purple and green times can be confusing, and are perhaps better placed (for a summary reading of the session) by best in session purple/green times; secondly, the evolution of the cut-off times would help to work out where drivers were placed at different stages of qualifying and what they still had to do – or whether a best-time-so-far recorded by a driver earlier in the session was bumped by the cutoff evolution. Note that the purple time evolution is identified implicitly by the lower envelope of the laptimes in each session.

## Tools in Tandem – SQL and ggplot. But is it Really R?

Increasingly I find that I have fallen into using not-really-R whilst playing around with Formula One stats data. Instead, I seem to be using a hybrid of SQL to get data out of a small SQLite3 datbase and into an R dataframe, and then ggplot2 to render visualise it.

So for example, I’ve recently been dabbling with laptime data from the ergast database, using it as the basis for counts of how many laps have been led by a particular driver. The recipe typically goes something like this – set up a database connection, and run a query:

#Set up a connection to a local copy of the ergast database
library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Run a query
q='SELECT code, grid, year, COUNT(l.lap) AS Laps
FROM (SELECT grid, raceId, driverId from results) rg,
lapTimes l, races r, drivers d
WHERE rg.raceId=l.raceId AND d.driverId=l.driverId
AND rg.driverId=l.driverId AND l.position=1 AND r.raceId=l.raceId
GROUP BY grid, driverRef, year
ORDER BY year'

driverlapsledfromgridposition=dbGetQuery(ergastdb,q)


In this case, the data is table that shows for each year a count of laps led by each driver given their grid position in corresponding races (null values are not reported). The data grabbed from the database is based into a dataframe in a relatively tidy format, from which we can easily generate a visualisation of it.

The chart I have opted for is a text plot faceted by year:

The count of lead laps for a given driver by grid position is given as a text label, sized by count, and rotated to mimimise overlap. The horizontal grid is actually a logarithmic scale, which “stretches out” the positions at the from of the grid (grid positions 1 and 2) compared to positions lower down the grid – where counts are likely to be lower anyway. To try to recapture some sense of where grid positions lay along the horizontal axis, a dashed vertical line at grid position 2.5 marks out the front row. The x-axis is further expanded to mitigate against labels being obfuscated or overflowing off the left hand side of the plotting area. The clean black and white theme finished off the chart.

g = ggplot(driverlapsledfromgridposition)
g = g + geom_vline(xintercept = 2.5, colour='lightgrey', linetype='dashed')
g = g + geom_text(aes(x=grid, y=code, label=Laps, size=log(Laps), angle=45))
g = g + facet_wrap(~year) + xlab(NULL) + ylab(NULL) + guides(size=FALSE)
g + scale_x_log10(expand=c(0,0.3)) + theme_bw()

There are still a few problems with this graphic, however. The order of labels on the y-axis is in alphabetical order, and would perhaps be more informative if ordered to show championship rankings, for example.

However, to return to the main theme of this post, whilst the R language and RStudio environment are being used as a medium within which this activity has taken place, the data wrangling and analysis (in the sense of counting) is being performed by the SQL query, and the visual representation and analysis (in the sense of faceting, for example, and generating visual cues based on data properties) is being performed by routines supplied as part of the ggplot library.

So if asked whether this is an example of using R for data analysis and visualisation, what would your response be? What does it take for something to be peculiarly or particularly an R based analysis?

For more details, see the “Laps Completed and Laps Led” draft chapter and the Wrangling F1 Data With R book.

## Calculating Churn in Seasonal Leagues

One of the things I wanted to explore in the production of the Wrangling F1 Data With R book was the extent to which I could draw on published academic papers for inspiration in exploring the the various results and timing datasets.

In a chapter published earlier this week, I explored the notion of churn, as described in Mizak, D, Neral, J & Stair, A (2007) The adjusted churn: an index of competitive balance for sports leagues based on changes in team standings over time. Economics Bulletin, Vol. 26, No. 3 pp. 1-7, and further appropriated by Berkowitz, J. P., Depken, C. A., & Wilson, D. P. (2011). When going in circles is going backward: Outcome uncertainty in NASCAR. Journal of Sports Economics, 12(3), 253-283.

In a competitive league, churn is defined as:

$C_t = \frac{\sum_{i=1}^{N}\left|f_{i,t} - f_{i,t-1}\right|}{N}$

where $C_t$ is the churn in team standings for year $t$, $\left|f_{i,t} - f_{i,t-1}\right|$ is the absolute value of the $i$-th team’s change in finishing position going from season $t-1$ to season $t$, and $N$ is the number of teams.

The adjusted churn is defined as an indicator with the range 0..1 by dividing the churn, $C_t$, by the maximum churn, $C_max$. The value of the maximum churn depends on whether there is an even or odd number of competitors:

$C_{max} = N/2 \text{, for even N}$

$C_{max} = (N^2 - 1) / 2N \text{, for odd N}$

Berkowitz et al. reconsidered churn as applied to an individual NASCAR race (that is, at the event level). In this case, $f_{i,t}$ is the position of driver $i$ at the end of race $t$, $f_{i,t-1}$ is the starting position of driver $i$ at the beginning of that race (that is, race $t$) and $N$ is the number of drivers participating in the race. Once again, the authors recognise the utility of normalising the churn value to give an *adjusted churn* in the range 0..1 by dividing through by the maximum churn value.

Using these models, I created churn function of the form:

is.even = function(x) x %% 2 == 0
churnmax=function(N)
if (is.even(N)) return(N/2) else return(((N*N)-1)/(2*N))

churn=function(d) sum(d)/length(d)
adjchurn = function(d) churn(d)/churnmax(length(d))

and then used it to explore churn in a variety of contexts:

• comparing grid positions vs race classifications across a season (cf. Berkowitz et al.)
• churn in Drivers’ Championship standings over several seasons (cf. Mizak et al.)
• churn in Constructors’ Championship standings over several seasons (cf. Mizak et al.)

For example, in the first case, we can process data from the ergast database as follows:

library(DBI)
ergastdb = dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

q=paste('SELECT round, name, driverRef, code, grid,
position, positionText, positionOrder
FROM results rs JOIN drivers d JOIN races r
ON rs.driverId=d.driverId AND rs.raceId=r.raceId
WHERE r.year=2013',sep='')
results=dbGetQuery(ergastdb,q)

library(plyr)
results['delta'] =  abs(results['grid']-results['positionOrder'])
churn.df = ddply(results[,c('round','name','delta')], .(round,name), summarise,
churn = churn(delta),
)


For more details, see this first release of the Keeping an Eye on Competitiveness – Tracking Churn chapter of the Wrangling F1 Data With R book.

## Identifying Position Change Groupings in Rank Ordered Lists

The title says it all, doesn’t it?!

Take the following example – it happens to show race positions by driver for each lap of a particular F1 grand prix, but it could be the evolution over time of any rank-based population.

The question I had in mind was – how can I identify positions that are being contested during a particular window of time, where by contested I mean that the particular position was held by more than one person in a particular window of time?

Let’s zoom in to look at a couple of particular steps.

We see distinct groups of individuals who swap positions with each other between those two consecutive steps, so how can we automatically detect the positions that these drivers are fighting over?

A solution given to a Stack Overflow question on how to get disjoint sets from a list in R gives what I thought was a really nice solution: treat it as a graph, and then grab the connected components.

Here’s my working of it. Start by getting a list of results that show a particular driver held different positions in the window selected – each row in the original dataframe identifies the position held by a particular driver at the end of a particular lap:

library(DBI)
ergastdb =dbConnect(RSQLite::SQLite(), './ergastdb13.sqlite')

#Get a race identifier for a specific race
raceId=dbGetQuery(ergastdb,
'SELECT raceId FROM races WHERE year="2012" AND round="1"')

q=paste('SELECT * FROM lapTimes WHERE raceId=',raceId[[1]])

lapTimes=dbGetQuery(ergastdb,q)
lapTimes$position=as.integer(lapTimes$position)

library(plyr)

#Sort by lap first just in case
lapTimes=arrange(lapTimes,driverId,lap)

#Create a couple of new columns
#pre is previous lap position held by a driver given their current lap
#ch is position change between the current and previous lap
tlx=ddply(lapTimes,.(driverId),transform,pre=(c(0,position[-length(position)])),ch=diff(c(0,position)))

#Find rows where there is a change between a given lap and its previous lap
#In particular, focus on lap 17
llx=tlx[tlx['ch']!=0 & tlx['lap']==17,c("position","pre")]

llx

This filters the complete set of data to just those rows where there is a difference between a driver’s current position and previous position (the first column in the result just shows row numbers and can be ignored).

##      position pre
## 17          2   1
## 191        17  18
## 390         9  10
## 448         1   2
## 506         6   4
## 719        10   9
## 834         4   5
## 892        18  19
## 950         5   6
## 1008       19  17

We can now create a graph in which nodes represent positions (position or pre values) and edges connect a current and previous position.

#install.packages("igraph")
#http://stackoverflow.com/a/25130575/454773
library(igraph)

posGraph = graph.data.frame(llx)

}

plot(posGraph)


The resulting graph is split into several components:

We can then identify the connected components:

posBattles=split(V(posGraph)$name, clusters(posGraph)$membership)
#Find the position change battles
for (i in 1:length(posBattles)) print(posBattles[[i]])

This gives the following clusters, and their corresponding members:

## [1] "2" "1"
## [1] "17" "18" "19"
## [1] "9"  "10"
## [1] "6" "4" "5"

To generalise this approach, I think we need to do a couple of things:

• allow a wider window within which to identify battles (so look over groups of three or more consecutive laps);
• simplify the way we detect position changes for a particular driver; for example, if we take the set of positions held by a driver within the desired window, if the cardinality of the set (that is, its size) is greater than one, then we have had at least one position change for that driver within that window. Each set of size > 1 of unique positions held by different drivers can be used to generate a set of distinct, unordered pairs that connect the positions (I think it only matters that they are connected, not that a driver specifically went from position x to position y going from one lap to the next?). If we generate the graph from the set of distinct unordered pairs taken across all drivers, we should then be able to identify the contested/driver change position clusters.

Hmm… I need to try that out… And when I do, if and when it works(?!), I’ll add a complete demonstration of it – and how we might make use of it – to the Wrangling F1 Data With R book.