Finding (Nearly) Duplicate Items in a Data Column

[WARNING – THIS IS A *BAD ADVICE* POST – it describes a trick that sort of works, but the example is contrived and has a better solution – text facet and then cluster on facet (h/t to @mhawksey’s Mining and OpenRefine(ing) JISCMail: A look at OER-DISCUSS [Listserv] for making me squirm so much through that oversight I felt the need to post this warning…]

Suppose you have a dataset containing a list of Twitter updates, and you are looking for tweets that are retweets or modified retweets of the same original tweet. The OpenRefine Duplicate custom facet will identify different row items in that column that are exact duplicates of each other, but what about when they just don’t quite match: for example, an original tweet and it’s appearance in an RT (where the retweet string contains RT and the name of the original sender), or an MT, where the tweet may have been shortened, or an RT of an RT, or an RT with an additional hashtag. Here’s one strategy for finding similar-ish tweets, such as popular retweets, in the data set using custom text facets in OpenRefine.

The Ngram GREL function generates a list of word ngrams of a specified length from a string. If you imagine a sliding window N words long, the first ngram will be the first N words in the string, the second ngram the second word to the second+N’th word, and so on:

If we can reasonably expect word sequences of length N to appear in out “duplicate-ish” strings, we can generate a facet on ngrams of that length.

It may also be worth experimenting with combining the ngram function with the fingerprint GREL function. The fingerprint function identifies unique words in a string, reduces them to lower case, and then sorts them in alphabetical order:

If we generate the fingerprint of a string, and then run the ngram function, we generate ngrams around the alphabetically ordered fingerprint terms:

For a sizeable dataset, and/or long strings, it’s likely that we’ll get a lot of ngram facet terms:

We could list them all by setting an appropriate choice count limit, or we can limit the facet items to be displayed by displaying the choice counts, setting the range slider to show those facet values that appear in a large number of columns for example, and then ordering the facet items by count:

Even if tweets aren’t identical, if they contain common ngrams we can pull them out.

Note that we might then order the tweets as displayed in the table using time/date order (a note on string to time format conversions can be found in this postFacets in OpenRefine):

Or alternatively, we might choose to view them using a time facet:

Note that when you set a range in the time facet, you can then click on it and slide it as a range, essentially providing a sliding time window control for viewing records that appear over a given time range/duration.

Author: Tony Hirst

I'm a Senior Lecturer at The Open University, with an interest in #opendata policy and practice, as well as general web tinkering...

One thought on “Finding (Nearly) Duplicate Items in a Data Column”

Comments are closed.