# Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying *features* (or *storypoints*) in the data that might provide a basis for making *interesting* text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the *next* post in this series, but for now, let’s consider the following question:*how can we group numbers that are nearly the same?* For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just

*nearly*, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap): '''Arrange data into groups where successive elements differ by no more than *maxgap* cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10) [[1, 6, 9], [100, 102, 105, 109], [134, 139]] cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10) [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]] ''' data.sort() groups = [[data[0]]] for x in data[1:]: if abs(x - groups[-1][-1]) <= maxgap: groups[-1].append(x) else: groups.append([x]) return groups print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35)) [[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a *maximum* distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None): data.sort() groups = [[data[0]]] for x in data[1:]: inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange if abs(x - groups[-1][-1]) <= maxgap and inmaxrange: groups[-1].append(x) groups[-1].append(x) else: groups.append([x]) return groups print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41)) [[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are *really* close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None): data.sort() groups = [[data[0]]] for x in data[1:]: inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange: groups[-1].append(x) else: groups.append([x]) return groups print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25)) [[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x] trythis[0] if len(trythis) else '' [2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised `pandas`

fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() return x def numclustergroup2(x,col,maxgap,maxrange): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() x['cdiff']=x.groupby('cluster')[col].diff() x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum() return x.drop('cdiff',1) def numclustergroup3(x,col,maxgap,maxrange,maxminrange): x=x.sort_values(col) x['cluster'] = (x[col].diff()>=maxgap).cumsum() x['cdiff']=x.groupby('cluster')[col].diff() x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum() return x.drop('cdiff',1) #Test uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]}) numclustergroup(uu,'y',0.35) numclustergroup2(uu,'y',0.35,0.41) numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as `True`

whenever you want to increase the group number.

Hi Tony, My guess is that some of this is Jenks Natural Breaks. Best wishes, Andy

Hi Andy – Thanks for that reference… I really should be a bit more disciplined in finding proven approaches, but sometimes I just want to hack stuff out that sort of works! ;-)

Note to self: https://gist.github.com/urschrei/6662164

or https://github.com/perrygeo/jenks

Which suggest you need to specify the number of groups? Which I didn’t want to do…