One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the *next* post in this series, but for now, let’s consider the following question:*how can we group numbers that are nearly the same?* For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap): '''Arrange data into groups where successive elements differ by no more than *maxgap* cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10) [[1, 6, 9], [100, 102, 105, 109], [134, 139]] cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10) [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]] ''' data.sort() groups = [[data[0]]] for x in data[1:]: if abs(x - groups[-1][-1]) <= maxgap: groups[-1].append(x) else: groups.append([x]) return groups print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35)) [[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a *maximum* distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None): data.sort() groups = [[data[0]]] for x in data[1:]: inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange if abs(x - groups[-1][-1]) <= maxgap and inmaxrange: groups[-1].append(x) groups[-1].append(x) else: groups.append([x]) return groups print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41)) [[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are *really* close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None): data.sort() groups = [[data[0]]] for x in data[1:]: inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange: groups[-1].append(x) else: groups.append([x]) return groups print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25)) [[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x] trythis[0] if len(trythis) else '' [2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised `pandas`

fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() return x def numclustergroup2(x,col,maxgap,maxrange): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() x['cdiff']=x.groupby('cluster')[col].diff() x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum() return x.drop('cdiff',1) def numclustergroup3(x,col,maxgap,maxrange,maxminrange): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() x['cdiff']=x.groupby('cluster')[col].diff() x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum() return x.drop('cdiff',1) #Test uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]}) numclustergroup(uu,'y',0.35) numclustergroup2(uu,'y',0.35,0.41) numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as `True`

whenever you want to increase the group number.