Detecting Features in Data Using Symbolic Coding and Regular Expression Pattern Matching

One of the reasons I dive into motorsport results and timing data every so often is that it gives me a quite limited set of data to play with. In turn, this means I have to get creative when it comes to reshaping the data to see what visuals I can pull out of it, as generating derived datasets to see what other story forms and insights might be hidden in there.

One of the things I hope to do with the WRC data is push a bit more on automatically generating text-based race reports from the data. Part of the trick here is spotting patterns that can be be mapped onto textual tropes, common sorts of phrase or sentence that you are likely to see in the more vanilla forms of sports reporting. (“X led the race from the start”, “Despite a poor start to the stage, Y went on to win it, N seconds ahead of Z in second place” and so on.)

So how can we spot the patterns? One way is to write a SQL query that detects a particular pattern in the data and uses that to flag a possible event (for example, Detecting Undercuts in F1 Races Using R). Another might be to cast the data as a graph and then detect features using graph based algorithms (eg Identifying Position Change Groupings in Rank Ordered Lists).

During the middle of last night, I woke up wondering whether or not it would be possible to cast simple feature components as symbols and then use a regular expression pattern matcher to identify a particular sort of pattern from a symbolic string. So here’s a quick proof of concept…

From the WRC Monte Carlo 2107 rally, stage 3, some split times and rank positions at each split.


Here’s a visual representation of the same (the number labels are rank position at each split, the y-axis is the delta to the fastest time recorded over that split (the “sector time”, if you will, derived data from the original results data).


For each driver, you may be able to spot several shapes. For example, Ogier is way behind at the first split, but then gains over the rest of the stage, Kreeke and Breen lose time at the second split, Hanninen loses it on the final part of the stage, and so on. Can we code for these different patterns, and then detect them?


So that seems to work okay… Now all I need to do is come up with some suitable symbolic encodings and pattern matching strings…

Hmmm… Vague memories… I wonder if there are any symbolic dynamics algorithms or finite state machine grammar parsers I could make use of?

Leave a Reply

Fill in your details below or click an icon to log in: Logo

You are commenting using your account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s