Grouping Numbers that are Nearly the Same – Casual Clustering

A couple of reasons for tinkering with WRC rally data this year, over and the above the obvious of wanting to find a way to engage with motorsport at a data level, specifically, I wanted a context for thinking a bit more about ways of generating (commentary) text from timing data, as well as a “safe” environment in which I could look for ways of identifying features (or storypoints) in the data that might provide a basis for making interesting text comments.

One way in to finding features is to look at a visual representations of the data (that is, just look at charts) and see what jumps out… If anything does, then you can ponder ways of automating the detection or recognition of those visually compelling features, or things that correspond to them, or proxy for them, in some way. I’ll give an example of that in the next post in this series, but for now, let’s consider the following question:how can we group numbers that are nearly the same? For example, if I have a set of stage split times, how can I identify groups of drivers that have recorded exactly, or even just nearly, the same time?

Via StackOverflow, I found the following handy fragment:

def cluster(data, maxgap):
    '''Arrange data into groups where successive elements
       differ by no more than *maxgap*

        cluster([1, 6, 9, 100, 102, 105, 109, 134, 139], maxgap=10)
        [[1, 6, 9], [100, 102, 105, 109], [134, 139]]

        cluster([1, 6, 9, 99, 100, 102, 105, 134, 139, 141], maxgap=10)
        [[1, 6, 9], [99, 100, 102, 105], [134, 139, 141]]

    '''
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        if abs(x - groups[-1][-1]) <= maxgap:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

It struck me that a tweak to the code could limit the range of any grouping relative to a maximum distance between the first and the last number in any particular grouping – maybe I don’t want a group to have a range more than 0.41 for example (that is, strictly more than a dodgy floating point 0.4…):

def cluster2(data, maxgap, maxrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0]) <=maxrange
        if abs(x - groups[-1][-1]) <= maxgap and inmaxrange:
            groups[-1].append(x)
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster2([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41))
[[2.1, 2.4, 2.5], [2.52], [3.9], [4.6], [7.4]]

A downside of this is we might argue we have mistakenly omitted a number that is very close to the last number in the previous group, when we should rightfully have included it, because it’s not really very far away from a number that is close to the group range threshold value…

In which case, we might pull back numbers into a group that are really close to the current last member in the group irrespective of whether we past the originally specified group range:

def cluster3(data, maxgap, maxrange=None, maxminrange=None):
    data.sort()
    groups = [[data[0]]]
    for x in data[1:]:
        inmaxrange = True if maxrange is None else abs(x-groups[-1][0])<=maxrange
        inmaxminrange = False if maxminrange is None else abs(x-groups[-1][-1])<=maxminrange
        if (abs(x - groups[-1][-1]) <= maxgap and inmaxrange) or inmaxminrange:
            groups[-1].append(x)
        else:
            groups.append([x])
    return groups

print(cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25))
[[2.1, 2.4, 2.5, 2.52], [3.9], [4.6], [7.4]]

With these simple fragments, I can now find groups of times that are reasonably close to each other.

I can also look for times that are close to other times:

trythis = [x for x in cluster3([2.1,7.4,3.9,4.6,2.5,2.4,2.52],0.35,0.41,0.25) if 2.4 in x]
trythis[0] if len(trythis) else ''
[2.1, 2.4, 2.5, 2.52]

PS I think the following vectorised pandas fragments assign group numbers to rows based on the near matches of numerics in a specified column:

def numclustergroup(x,col,maxgap):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    return x

def numclustergroup2(x,col,maxgap,maxrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = ((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)).cumsum()
    return x.drop('cdiff',1)

def numclustergroup3(x,col,maxgap,maxrange,maxminrange):
    x=x.sort_values(col)
    x['cluster'] = (x[col].diff()>=maxgap).cumsum()
    x['cdiff']=x.groupby('cluster')[col].diff()
    x['cluster'] = (((x.groupby('cluster')['cdiff'].cumsum()>maxrange) | (x[col].diff()>=maxgap)) & (x[col].diff()>maxminrange) ).cumsum()
    return x.drop('cdiff',1)

#Test
uu=pd.DataFrame({'x':list(range(0,8)),'y':[1.3,2.1,7.4,3.9,4.6,2.5,2.4,2.52]})
numclustergroup(uu,'y',0.35)
numclustergroup2(uu,'y',0.35,0.41)
numclustergroup3(uu,'y',0.35,0.41,0.25)

The basic idea is to generate logical tests that evaluate as True whenever you want to increase the group number.

Data Journalists Engaging in Co-Innovation…

You may or may not have noticed that the Boundary Commission released their take on proposed parliamentary constituency boundaries today.

They could have released the data – as data – in the form of shape files that can be rendered at the click of a button in things like Google Maps… but they didn’t… [The one thing the Boundary Commission quango forgot to produce: a map] (There are issues with publishing the actual shapefiles, of course. For one thing, the boundaries may yet change – and if the original shapefiles are left hanging around, people may start to draw on these now incorrect sources of data once the boundaries are fixed. But that’s a minor issue…)

Instead, you have to download a series of hefty PDFs, one per region, to get a flavour of the boundary changes. Drawing a direct comparison with the current boundaries is not possible.

The make-up of the actual constituencies appears to based on their member wards, data which is provided in a series of spreadsheets, one per region, each containing several sheets describing the ward makeup of each new constituency for the counties in the corresponding region.

It didn’t take long for the data junkies to get on the case though. From my perspective, the first map I saw was on the Guardian Datastore, reusing work by University of Sheffield academic Alasdair Rae, apparently created using Google Fusion Tables (though I haven’t see a recipe published anywhere? Or a link to the KML file that I saw Guardian Datablog editor Simon Rogers/@smfrogers tweet about?)

[I knew I should have grabbed a screen shot of the original map…:-(]

It appears that Conrad Quilty-Harper (@coneee) over at the Telegraph then got on the case, and came up with a comparative map drawing on Rae’s work as published on the Datablog, showing the current boundaries compared to the proposed changes, and which ties the maps together so the zoom level and focus are matched across the maps (MPs’ constituencies: boundary changes mapped):

Telegraph side by side map comparison

Interestingly, I was alerted to this map by Simon tweeting that he liked the Telegraph map so much, they’d reused the idea (and maybe even the code?) on the Guardian site. Here’s a snapshot of the conversation between these two data journalists over the course of the day (reverse chronological order):

Datajournalists in co-operative bootstrapping mode

Here’s the handshake…

Collaborative co-evolution

I absolutely love this… and what’s more, it happened over the course of four or five hours, with a couple of technology/knowledge transfers along the way, as well as evolution in the way both news agencies communicated the information compared to the way the Boundary Commission released it. (If I was evil, I’d try to FOI the Boundary Commission to see how much time, effort and expense went into their communication effort around the proposed changes, and would then try to guesstimate how much the Guardian and Telegraph teams put into it as a comparison…)

At the time of writing (15.30), the BBC have no data driven take on this story…

And out of interest, I also wondered whether Sheffield U had a take…

Sheffiled u media site

Maybe not…

PS By the by, the DataDrivenJournalism.net website relaunched today. I’m honoured to be on the editorial board, along with @paulbradshaw @nicolaskb @mirkolorenz @smfrogers and @stiles, and looking forward to seeing how we can start to drive interest, engagement and skills development in, as well as analysis and (re)use of, and commentary on, public open data through the data journalism route…

PPS if you’re into data journalism, you may also be interested in GetTheData.org, a question and answer site in the model of Stack Overflow, with an emphasis on Q&A around how to find, access, and make use of open and public datasets.

Data DOIs

Okay, here’s another Friday twitter brainstorm capture post, this time arising from my responses to @jimdowning who made a query about in response to a tweet I made about an interesting looking “DOIs for data” proposal…

Here’s what I pondered:

– why might it be useful? Err, “eg allows to resolve either to Guardian data blog data on google docs or national stats copy of a data set?” That is, several of the data sets that have been republished by the Guardian on google docs duplicate (I think) data from National Statistics. A data DOI service could resolve to either of these data sets depending on a user’s preferences…

Hmmm… ;-)

But I can also imagine derived data DOIs that extend eg journal paper DOIs in a standardised way, and then point to data that relates to the original journal article. So for example, an article DOI such as doi:nnn-nnn.n might be used to generate a data DOI that extends the original DOI, such as doi:nnn-nnn.n-data; or we might imagine a parallel data DOI resolution service that reuses the same DOI: data-doi:nnn-nnn.n.

Where multiple data sets are associated with an article, it might be pragmatic to add a suffix to the doi, such as data-doi:nnn-nnn.n-M to represent the M’th dataset associated with the article? For only one dataset, it could be identified as data-doi:nnn-nnn.n-0 (always count from 0, right?;-), with data-doi:nnn-nnn.n (i.e. no suffix) returning the number of data sets associated with the article?

PS hmmm this reminds me of something the name of which I forget (cf. image extraction from PDFs), where assets associated with an article are unbundled from the article (images, tabulated data and so on); how are these things referenced? Are references derivable from the parent URI?

PPS Maybe related? (I really need to get round to reading this…) How Data RSS Might Workl.