Identifying Periodic Google Trends, Part 1: Autocorrelation

One of the thing many things we’re all pretty good, partly because of the way we’re wired, is spotting visual patterns. Take the following image, for example, which is taken from Google Trends and shows relative search volume for the term “flowers” over the last few years:

The trend shows annual periodic behaviour (the same thing happens every year), with a couple of significant peaks showing heavy search volumes around the term on two separate occasions, a lesser blip between them and a small peak just before Christmas; can you guess what these occasions relate to?;-) The data itself can be downloaded in a tatty csv file from the link at the bottom left of the page (tatty because several distinct CSV data sets are contained in the CSV file, separated by blank lines.) The sampling frequency is once per week.

The flowers trace actually holds a wealth of secrets – behaviours vary across UK and the US, for example – but for now I’m going to ignore that detail (I’ll return to it in a later post). Instead, I’m just going to (start) asking a very simple question – can we automatically detect the periodicity in the trend data?

Way back when, my first degree was electronics. Many of the courses I studied related to describing in mathematical terms the structure of “systems” and the analysis of the structure of signals; ideal grounding for looking at time series data such as the Google Trends data, and web analytics data.

Though I’ve since forgotten much of what I’ve studied then, I can remember the names of many of the techniques and methods, if not how to apply them. So one thing I intend to do over the next quarter is something of a refresher in signal processing/time series analysis (which is to say, I would appreciate comments on at least three counts: firstly, if I make a mistake, please feel free you are obliged to point it out; secondly, if I’m missing a trick, or an alternative/better way of achieving a similar or better end, please point it out; thirdly, the approach I take will be rediscovering the electronics/engineering take on this sort of analysis. Time series analysis is also widely used in biology, economics etc etc, though the approach or interpretation taken in different disciplines may be different* – if you can help bridge my (lack of) engineering understanding with a biological or economic perspective/interpretation, please do so;-)

(*I discovered this in my PhD, when I noticed that the equations used to describe evolution in genetic populations in discrete and continuous models were the same as equations used to describe different sorts of low pass filters in electronics; which means that under the electronics inspired interpretation of the biological models, we could by inspection say populations track low frequency components (components with a periodicity over 10s of generations) and ignore high frequency components. The biologists weren’t interested…)

To start with, let’s consider the autocorrelation of the trend data. Autocorrelation measures the extent to which a signal is correlated with (i.e. similar to) itself over time. Essentially, it is calculated from the product of the signal at each sample point with a timeshifted version of itself. (Wikipedia is as good as anywhere to look up the formal definition of autocorrelation.)

I used the Python matplotlib to calculate the autocorrelation using this gist. The numbers in the array are the search volume values exported from Google Trends.

The top trace shows the original time series data – in this case the search volume (arbitrary units) of the term “flowers” over the last few years, with a sample frequency of once per week.

The second trace is the autocorrelation, over all timeshifts. Whilst there appear to be a couple of peaks in the data, it’s quite hard to read, because the variance of original signal is not so great. Most of the time the signal value is close to 1, with occasional excursions away from that value. However, if we subtract the average signal value from the original signal value, (finding g(t)=f(t)-MEAN(f)) and then run the autocorrelation function, we get a much more striking view of the autocorrelation of the data:

(if I’ve been really, really naughty doing this, please let me know; I also experimented with substracting the minimum value to set the floor of the signal to 0;-)

A couple of things are worth noticing: firstly, the autocorrelation is symmetrical about the origin; secondly, the autocorrelation pattern repeats every 52 weeks (52 timeshifted steps)… Let’s zoom in a bit by setting the maxlags value in the script to 53, so we can focus on the autocorrelation values over a 52 week period:

So – what does the autocorrelogram(?) tell us? Firstly, there is a periodicity over the course of the year. Secondly, there appears to be a couple of features 12 weeks or so apart (subject to a bit of jitter…). That is, there is a correlation between f(t) and f(t+12), as well as f(t) and f(t-40), (where 40=52-12…)

Here’s another trend – turkey:

Again, the annual periodicity is detected, as well as a couple of features that are four weeks apart…

How about a more regular trend – full moon perhaps?

This time, we see peaks 4 weeks apart across the year – the monthly periodicity has been detected.

Okay – that’s enough for now… there are three next steps I have in mind: 1) to have a tinker with the Google Analytics data export API and plug samples of Googalytics time series data into an autocorrelation function to see what sorts of periodic behaviour I can detect; 2) find how to drive some Fourier Transform code so I can do some rather more structured harmonic analysis on the time series data; 3) blog a bit about linear systems, and show how things like the “flowers” trend data is actually made up of several separate, well-defined signals.

But first… marking:-(

PS here’s a great review of looking at time series data for a search on “ebooks” using Google Insights for Search data using R: eBooks in Education – Looking at Trends

Matplotlib: Detrending Time Series Data

Reading the rather wonderful Data Analysis with Open Source Tools (if you haven’t already got a copy, get one… NOW…), I noticed a comment that autocorrelation “is intended for time series that do not exhibit a trend and have zero mean”. Doh! Doh! And three times: doh!

I’d already come the same conclusion, pragmatically, in Identifying Periodic Google Trends, Part 1: Autocorrelation and Improving Autocorrelation Calculations on Google Trends Data but now I’ll be remembering this as a condition of use;-)

One issue I had come across with trying to plot a copy of the mean zero and/or detrended data as calculated using Matplotlib was how to plot a copy of the detrended data directly. (I’d already worked out how to use the detrend_ filters in the autocorrelation function).

The problem I had was simply trying plot mlab.detrend_linear(y) as applied to list of values y threw an error (“AttributeError: ‘list’ object has no attribute ‘mean'”). It seems that detrend expects y=[1,2,3] to have a method y.mean(); which it doesn’t, normally…

The trick appears to be that matplotlib prefers to use something like a structured array, rather than a simple list, which offers these additional methods. Biut how is the data structured? A simple Google wasn’t much help, but a couple of cribs suggested that casting the list to y=np.array(y) (where import numpy as np) might be a good idea.

So let’s try it:

import matplotlib.pyplot as plt
import numpy as np

label='run'
d=[0.99,0.98,0.95,0.93,0.91,0.93,0.92,0.95,0.95,0.94,0.96,0.98,0.97,1.00,1.01,1.05,1.06,1.06,0.98,0.98,0.98,0.97,0.96,0.93,0.93,0.96,0.95,1.05,0.97,0.95,1.01,1.02,0.98,1.01,0.98,1.00,1.06,1.04,1.06,1.04,0.97,0.94,0.92,0.90,0.87,0.88,0.85,0.90,0.91,0.87,0.88,0.88,0.91,0.91,0.88,0.91,0.92,0.91,0.90,0.92,0.87,0.92,0.92,0.92,0.94,0.97,0.99,1.01,1.01,1.04,0.97,0.94,0.98,0.94,0.98,0.91,0.93,0.92,0.95,1.00,0.93,0.93,0.96,0.96,0.96,0.97,0.95,0.95,1.06,1.12,1.01,1.00,0.99,0.98,0.96,0.93,0.91,0.92,0.92,0.94,0.94,0.94,0.90,0.86,0.89,0.93,0.90,0.90,0.90,0.90,0.89,0.92,0.91,0.92,0.93,0.93,0.94,0.99,0.98,0.99,1.01,1.06,1.06,0.96,0.98,0.92,0.92,0.93,0.91,0.90,0.93,1.02,0.90,0.93,0.91,0.93,0.95,0.93,0.91,0.92,0.96,0.93,1.02,1.02,0.91,0.88,0.87,0.87,0.84,0.82,0.82,0.84,0.83,0.85,0.80,0.80,0.87,0.85,0.83,0.80,0.84,0.83,0.84,0.88,0.83,0.88,0.88,0.86,0.91,0.93,0.91,0.97,0.96,1.00,1.01,0.98,0.94,0.97,0.94,0.95,0.92,0.93,0.97,1.02,0.95,0.92,0.91,0.95,0.93,0.94,0.91,0.92,0.98,0.99,0.97,0.98,0.90,0.86,0.87,0.91,0.87,0.86,0.86,0.89,0.89,0.87,0.86,0.83,0.85,0.86,0.90,0.87,0.87,0.90,0.89,0.93,0.93,0.97,0.99,0.95,1.00,1.05,1.03,1.04,1.08,1.05,1.05,1.05,1.05,1.01,1.07,1.02,1.02,1.04,1.00,1.04,1.17,1.03,1.01,1.02,1.05,1.06,1.05,0.99,1.07,1.03,1.05,1.07,1.04,0.97,0.94,0.97,0.93,0.94,0.96,0.96,1.04,1.05,1.04,0.96,1.00,1.04,1.01,1.00,0.99,0.99,0.99,1.03,1.05,1.02,1.06,1.07,1.04,1.16,1.19,1.12,1.18,1.19,1.16,1.12,1.12,1.09,1.12,1.11,1.12,1.06,1.05,1.14,1.26,1.09,1.12,1.13,1.16,1.18,1.22,1.17,1.24,1.28,1.35,1.19,1.16,1.11,1.11,1.13,1.13,1.11,1.09,1.06,1.07,1.09,1.09,1.03,1.05,1.04,1.04,1.03,1.03,1.06,1.09,1.17,1.12,1.11,1.14,1.20,1.18,1.24,1.19,1.21,1.22,1.22,1.27,1.25,1.18,1.15,1.18,1.17,1.11,1.09,1.10,1.12,1.26,1.15,1.15,1.16,1.16,1.15,1.12,1.15,1.14,1.20,1.31,1.17,1.18,1.14,1.15,1.14,1.12,1.17,1.11,1.10,1.11,1.14,1.10,1.08,1.06]

fig = plt.figure()
da=np.array(d)

ax1 = fig.add_subplot(211)
ax1.plot(da)

ax2 = fig.add_subplot(211)
y= mlab.detrend_linear(da)
ax2.plot(y)

ax3 = fig.add_subplot(211)
ax3.plot(da-y)

Here’s the result:

The top, ragged trace is the original data (in the d list); the lower trace is the same data, detrended; the straight line is the line that is subtracted from the original data to produce the detrended data.

The lower trace would be the one that gets used by the autocorrelation function using the detrend_linear setting. (To detrend based on simply setting the mean to zero, I think all we need to do is process da-da.mean()?

UPDATE: One of the problems with detrending the time series data using the linear trend is that the increasing trend doesn’t appear to start until midway through the series. Another approach to cleaning the data is to use remove the mean and trend by using the first difference of the signal: d(x)=f(x)-f(x-1). It’s calculated as follows:

#time series data in d
#first difference
fd=np.diff(d)

Here’s the linearly detrended data (green) compared to the first difference of the data (blue):

Note that the length of the first difference signal is one sample less than the orginal data, and shifted to the left one step. (There’s presumably a numpy way of padding the head or tail of the series, though I’m not sure what it is yet!)

Here’s the autocorrelation of the first difference signal – if you refer back to the previous post, you’ll see it’s much clearer in this case:

It is possible to pass an arbitrary detrending function into acorr, but I think it needs to return an array that is the same length as the original array?

So what next? Looking at the original data, it is quite noisy, with some apparently obvious to the eye trends. The diff calculation is quite sensitive to this noise, so it possibly makes sense to smooth the data prior to calculating the first difference and the autocorrelation. But that’s for next time…