Using Regular Expressions to Filter Tweets Based on the Number of Times a Pattern Appears Within Them

Every so often, it’s good to get a little practice in  using regular expressions… Via the wonderful F1 Metrics blog, I noticed that the @f1debrief twitter account had been tweeting laptimes from the F1 testing sessions. The data wasn’t republished in the F1metrics blog (though I guess I could have tried to scrape it from the charts) but it still is viewable on the @f1debrief timeline, so I grabbed the relevant tweets using the Twitter API statuses/user_timeline call:

response1 = make_twitter_request(twitter_api.statuses.user_timeline, 
                                screen_name='f1debrief',count=200,exclude_replies='true',trim_user='true',
                               include_rts='false',max_id='705792403572178944')
response2 = make_twitter_request(twitter_api.statuses.user_timeline, 
                                screen_name='f1debrief',count=200,exclude_replies='true',trim_user='true',
                               include_rts='false',max_id=response1[-1]['id'])
tweets=response1+response2

The tweets I was interested in look like this (and variants thereof):

F1_Debrief___f1debrief____Twitter

The first thing I wanted to do was to limit the list of tweets I’d grabbed to just the ones that contained a list of laptimes. The way I opted to do this was to create a regular expression that spotted patterns of the form N.NN, and then select tweets that had three or more instances of this pattern. The regular expression .findall method will find all instances of the specified pattern in a string and return them in a list.

import re

regexp = re.compile(r'\d\.\d')

#reverse the order of the tweets so they are in ascending time order
for i in tweets[::-1]:
    if len(re.findall(regexp, i['text']))>=3:
        #...do something with the tweets containing 3 or more laptimes

Inspecting several of the timing related tweets, they generally conform to a pattern of:

  • first line: information about the driver and the tyres (in brackets)
  • a list of laptimes, each time on a separate line;
  • an optional final line that typically started with a hashtag

We can use a regular expression match to try to pull out the name of the driver and tyre compound based on a common text pattern:

#The driver name typically appears immediately after the word του
#The tyre compound appears in brackets
regexp3 = re.compile('^.* του (.*).*\s?\((.*)\).*')
#I could have tried to extract drivers more explicitly from a list of drivers names I knew to be participating

#split the tweet text by end of line character
lines=i['text'].split('\n')

#Try to pull out the driver name and tyre compound from the first line
m = re.match(regexp3, lines[0])
if m:
    print('',m.group(1).split(' ')[0],'..',m.group(2))
    #There is occasionally some text between the driver name and the bracketed tyre compound
    #So split on a space and select the first item
    dr=m.group(1).split(' ')[0]
    ty=m.group(2)
else:
    dr=lines[0]
    ty=''

For the timings, we need to do a little bit of tidying. Generally times were of the form N:NN.NN, but some were of the form NN.NN. In addition, there were occasional rogue spaces in the timings. In this case, we can use regular expressions to substitute on a particular pattern:

for j in lines[1:]:
            j=re.sub('^(\d+\.)','1:\\1',j)
            j=re.sub('^(\d)\s?:\s?(\d)','\\1:\\2',j)

The final code can be found in this gist and gives output of the form:

f1test2016__txt_and_notebooks

There are a few messed up lines, as the example shows, but these are easily handled by hand. (There is often a trade-off between fully automating and partially automating a scrape. Sometimes it can be quick just to do a final bit of tidying up in a text editor.) In the example output, I also put in an easily identified line (starting with == that shows the original first line of a tweet (it also has the benefit of making it easy to find the last line of the previous tweet, just in case that needs tidying too…) These marker lines can easily be removed from the file using a regular expression pattern as the basis of a search and replace (replacing with nothing to delete the line).

So that’s three ways of using regular expressions – to count the occurrences of a pattern and use that as the basis of a filter; to extract elements based on pattern matching in a string; and as the basis for a direct pattern based string replace/substitution.