Matplotlib: Detrending Time Series Data

Reading the rather wonderful Data Analysis with Open Source Tools (if you haven’t already got a copy, get one… NOW…), I noticed a comment that autocorrelation “is intended for time series that do not exhibit a trend and have zero mean”. Doh! Doh! And three times: doh!

I’d already come the same conclusion, pragmatically, in Identifying Periodic Google Trends, Part 1: Autocorrelation and Improving Autocorrelation Calculations on Google Trends Data but now I’ll be remembering this as a condition of use;-)

One issue I had come across with trying to plot a copy of the mean zero and/or detrended data as calculated using Matplotlib was how to plot a copy of the detrended data directly. (I’d already worked out how to use the detrend_ filters in the autocorrelation function).

The problem I had was simply trying plot mlab.detrend_linear(y) as applied to list of values y threw an error (“AttributeError: ‘list’ object has no attribute ‘mean'”). It seems that detrend expects y=[1,2,3] to have a method y.mean(); which it doesn’t, normally…

The trick appears to be that matplotlib prefers to use something like a structured array, rather than a simple list, which offers these additional methods. Biut how is the data structured? A simple Google wasn’t much help, but a couple of cribs suggested that casting the list to y=np.array(y) (where import numpy as np) might be a good idea.

So let’s try it:

import matplotlib.pyplot as plt
import numpy as np

label='run'
d=[0.99,0.98,0.95,0.93,0.91,0.93,0.92,0.95,0.95,0.94,0.96,0.98,0.97,1.00,1.01,1.05,1.06,1.06,0.98,0.98,0.98,0.97,0.96,0.93,0.93,0.96,0.95,1.05,0.97,0.95,1.01,1.02,0.98,1.01,0.98,1.00,1.06,1.04,1.06,1.04,0.97,0.94,0.92,0.90,0.87,0.88,0.85,0.90,0.91,0.87,0.88,0.88,0.91,0.91,0.88,0.91,0.92,0.91,0.90,0.92,0.87,0.92,0.92,0.92,0.94,0.97,0.99,1.01,1.01,1.04,0.97,0.94,0.98,0.94,0.98,0.91,0.93,0.92,0.95,1.00,0.93,0.93,0.96,0.96,0.96,0.97,0.95,0.95,1.06,1.12,1.01,1.00,0.99,0.98,0.96,0.93,0.91,0.92,0.92,0.94,0.94,0.94,0.90,0.86,0.89,0.93,0.90,0.90,0.90,0.90,0.89,0.92,0.91,0.92,0.93,0.93,0.94,0.99,0.98,0.99,1.01,1.06,1.06,0.96,0.98,0.92,0.92,0.93,0.91,0.90,0.93,1.02,0.90,0.93,0.91,0.93,0.95,0.93,0.91,0.92,0.96,0.93,1.02,1.02,0.91,0.88,0.87,0.87,0.84,0.82,0.82,0.84,0.83,0.85,0.80,0.80,0.87,0.85,0.83,0.80,0.84,0.83,0.84,0.88,0.83,0.88,0.88,0.86,0.91,0.93,0.91,0.97,0.96,1.00,1.01,0.98,0.94,0.97,0.94,0.95,0.92,0.93,0.97,1.02,0.95,0.92,0.91,0.95,0.93,0.94,0.91,0.92,0.98,0.99,0.97,0.98,0.90,0.86,0.87,0.91,0.87,0.86,0.86,0.89,0.89,0.87,0.86,0.83,0.85,0.86,0.90,0.87,0.87,0.90,0.89,0.93,0.93,0.97,0.99,0.95,1.00,1.05,1.03,1.04,1.08,1.05,1.05,1.05,1.05,1.01,1.07,1.02,1.02,1.04,1.00,1.04,1.17,1.03,1.01,1.02,1.05,1.06,1.05,0.99,1.07,1.03,1.05,1.07,1.04,0.97,0.94,0.97,0.93,0.94,0.96,0.96,1.04,1.05,1.04,0.96,1.00,1.04,1.01,1.00,0.99,0.99,0.99,1.03,1.05,1.02,1.06,1.07,1.04,1.16,1.19,1.12,1.18,1.19,1.16,1.12,1.12,1.09,1.12,1.11,1.12,1.06,1.05,1.14,1.26,1.09,1.12,1.13,1.16,1.18,1.22,1.17,1.24,1.28,1.35,1.19,1.16,1.11,1.11,1.13,1.13,1.11,1.09,1.06,1.07,1.09,1.09,1.03,1.05,1.04,1.04,1.03,1.03,1.06,1.09,1.17,1.12,1.11,1.14,1.20,1.18,1.24,1.19,1.21,1.22,1.22,1.27,1.25,1.18,1.15,1.18,1.17,1.11,1.09,1.10,1.12,1.26,1.15,1.15,1.16,1.16,1.15,1.12,1.15,1.14,1.20,1.31,1.17,1.18,1.14,1.15,1.14,1.12,1.17,1.11,1.10,1.11,1.14,1.10,1.08,1.06]

fig = plt.figure()
da=np.array(d)

ax1 = fig.add_subplot(211)
ax1.plot(da)

ax2 = fig.add_subplot(211)
y= mlab.detrend_linear(da)
ax2.plot(y)

ax3 = fig.add_subplot(211)
ax3.plot(da-y)

Here’s the result:

The top, ragged trace is the original data (in the d list); the lower trace is the same data, detrended; the straight line is the line that is subtracted from the original data to produce the detrended data.

The lower trace would be the one that gets used by the autocorrelation function using the detrend_linear setting. (To detrend based on simply setting the mean to zero, I think all we need to do is process da-da.mean()?

UPDATE: One of the problems with detrending the time series data using the linear trend is that the increasing trend doesn’t appear to start until midway through the series. Another approach to cleaning the data is to use remove the mean and trend by using the first difference of the signal: d(x)=f(x)-f(x-1). It’s calculated as follows:

#time series data in d
#first difference
fd=np.diff(d)

Here’s the linearly detrended data (green) compared to the first difference of the data (blue):

Note that the length of the first difference signal is one sample less than the orginal data, and shifted to the left one step. (There’s presumably a numpy way of padding the head or tail of the series, though I’m not sure what it is yet!)

Here’s the autocorrelation of the first difference signal – if you refer back to the previous post, you’ll see it’s much clearer in this case:

It is possible to pass an arbitrary detrending function into acorr, but I think it needs to return an array that is the same length as the original array?

So what next? Looking at the original data, it is quite noisy, with some apparently obvious to the eye trends. The diff calculation is quite sensitive to this noise, so it possibly makes sense to smooth the data prior to calculating the first difference and the autocorrelation. But that’s for next time…

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “Matplotlib: Detrending Time Series Data”

Comments are closed.