Improving Autocorrelation Calculations on Google Trends Data

In Identifying Periodic Google Trends, Part 1: Autocorrelation, I described how to calculate the autocorrelation statistic for Google Trend data using matplotlib. One the hacks that I found was required in order to calculate an informative autocorrelogram was to subtract the mean signal value from the original signal before running the calculation.

A more pathological situation occurs in the following case, using the Google Trends data for “run”:

Visual inspection of the original trend data suggests there is annual periodicity (note to self: learn how to add vertical gridlines at required points using matplotlib;-):

However, the autocorelogram does not detect the periodicity for two reasons: firstly, as with the previous cases, the non-zero mean value of the original time series data means the periodic excursions are attenuated in the autocorrelation calculation compared to excursions form a mean zero; and secondly, the increasing trend of the data adds further confusion to the year on year comparisons used in autocorrelation calculation.

Googling around remove trend and matplotlib turned up a detrend function that looked like it could help clean the data used for the autocorrelation calculation. In fact, the detrend function is mentioned in the acorr autocorrelation function documentation, although no details of values the function can take are provided there. However, searching the rest of that documentation page for detrend does turn up valid values for the function: detrend=mlab.detrend_mean, and mlab.detrend_linear, mlab.detrend_none where import matplotlib.mlab as mlab

If we set the detrend processor to mlab.detrend_mean we get the following:

And with detrend set to mlab.detrend_linear we get:

In each of these latter two cases, we see evidence of the 52 week correlation (i.e. annual periodicity).

FWIW, here’s the gist for the modified code.

Identifying Periodic Google Trends, Part 1: Autocorrelation

One of the thing many things we’re all pretty good, partly because of the way we’re wired, is spotting visual patterns. Take the following image, for example, which is taken from Google Trends and shows relative search volume for the term “flowers” over the last few years:

The trend shows annual periodic behaviour (the same thing happens every year), with a couple of significant peaks showing heavy search volumes around the term on two separate occasions, a lesser blip between them and a small peak just before Christmas; can you guess what these occasions relate to?;-) The data itself can be downloaded in a tatty csv file from the link at the bottom left of the page (tatty because several distinct CSV data sets are contained in the CSV file, separated by blank lines.) The sampling frequency is once per week.

The flowers trace actually holds a wealth of secrets – behaviours vary across UK and the US, for example – but for now I’m going to ignore that detail (I’ll return to it in a later post). Instead, I’m just going to (start) asking a very simple question – can we automatically detect the periodicity in the trend data?

Way back when, my first degree was electronics. Many of the courses I studied related to describing in mathematical terms the structure of “systems” and the analysis of the structure of signals; ideal grounding for looking at time series data such as the Google Trends data, and web analytics data.

Though I’ve since forgotten much of what I’ve studied then, I can remember the names of many of the techniques and methods, if not how to apply them. So one thing I intend to do over the next quarter is something of a refresher in signal processing/time series analysis (which is to say, I would appreciate comments on at least three counts: firstly, if I make a mistake, please feel free you are obliged to point it out; secondly, if I’m missing a trick, or an alternative/better way of achieving a similar or better end, please point it out; thirdly, the approach I take will be rediscovering the electronics/engineering take on this sort of analysis. Time series analysis is also widely used in biology, economics etc etc, though the approach or interpretation taken in different disciplines may be different* – if you can help bridge my (lack of) engineering understanding with a biological or economic perspective/interpretation, please do so;-)

(*I discovered this in my PhD, when I noticed that the equations used to describe evolution in genetic populations in discrete and continuous models were the same as equations used to describe different sorts of low pass filters in electronics; which means that under the electronics inspired interpretation of the biological models, we could by inspection say populations track low frequency components (components with a periodicity over 10s of generations) and ignore high frequency components. The biologists weren’t interested…)

To start with, let’s consider the autocorrelation of the trend data. Autocorrelation measures the extent to which a signal is correlated with (i.e. similar to) itself over time. Essentially, it is calculated from the product of the signal at each sample point with a timeshifted version of itself. (Wikipedia is as good as anywhere to look up the formal definition of autocorrelation.)

I used the Python matplotlib to calculate the autocorrelation using this gist. The numbers in the array are the search volume values exported from Google Trends.

The top trace shows the original time series data – in this case the search volume (arbitrary units) of the term “flowers” over the last few years, with a sample frequency of once per week.

The second trace is the autocorrelation, over all timeshifts. Whilst there appear to be a couple of peaks in the data, it’s quite hard to read, because the variance of original signal is not so great. Most of the time the signal value is close to 1, with occasional excursions away from that value. However, if we subtract the average signal value from the original signal value, (finding g(t)=f(t)-MEAN(f)) and then run the autocorrelation function, we get a much more striking view of the autocorrelation of the data:

(if I’ve been really, really naughty doing this, please let me know; I also experimented with substracting the minimum value to set the floor of the signal to 0;-)

A couple of things are worth noticing: firstly, the autocorrelation is symmetrical about the origin; secondly, the autocorrelation pattern repeats every 52 weeks (52 timeshifted steps)… Let’s zoom in a bit by setting the maxlags value in the script to 53, so we can focus on the autocorrelation values over a 52 week period:

So – what does the autocorrelogram(?) tell us? Firstly, there is a periodicity over the course of the year. Secondly, there appears to be a couple of features 12 weeks or so apart (subject to a bit of jitter…). That is, there is a correlation between f(t) and f(t+12), as well as f(t) and f(t-40), (where 40=52-12…)

Here’s another trend – turkey:

Again, the annual periodicity is detected, as well as a couple of features that are four weeks apart…

How about a more regular trend – full moon perhaps?

This time, we see peaks 4 weeks apart across the year – the monthly periodicity has been detected.

Okay – that’s enough for now… there are three next steps I have in mind: 1) to have a tinker with the Google Analytics data export API and plug samples of Googalytics time series data into an autocorrelation function to see what sorts of periodic behaviour I can detect; 2) find how to drive some Fourier Transform code so I can do some rather more structured harmonic analysis on the time series data; 3) blog a bit about linear systems, and show how things like the “flowers” trend data is actually made up of several separate, well-defined signals.

But first… marking:-(

PS here’s a great review of looking at time series data for a search on “ebooks” using Google Insights for Search data using R: eBooks in Education – Looking at Trends