Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying features (or storypoints) in the data that might provide a basis for making interesting text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the next post in this series, but for now, let’s consider the following question:how can we group numbers that are nearly the same? For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just nearly, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a maximum distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange
        if abs(x - groups[-1][-1]) <= maxgap and inmaxrange:
            groups[-1].append(x)
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41))
[[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are really close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange
        inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange
        if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x]
trythis[0] if len(trythis) else ''
[2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised pandas fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    return x

def numclustergroup2(x,col,maxgap,maxrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum()
    return x.drop('cdiff',1)

def numclustergroup3(x,col,maxgap,maxrange,maxminrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum()
    return x.drop('cdiff',1)

#Test
uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]})
numclustergroup(uu,'y',0.35)
numclustergroup2(uu,'y',0.35,0.41)
numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as True whenever you want to increase the group number.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

3 thoughts on “Grouping Numbers that are Nearly the Same – Casual Clustering”

    1. Hi Andy – Thanks for that reference… I really should be a bit more disciplined in finding proven approaches, but sometimes I just want to hack stuff out that sort of works! ;-)

Comments are closed.

%d bloggers like this: