Data Cleaning – Finding Near Matches in Names

In the post What Nationality Did You Say You Were, Again? I showed how we could use the fuzzyset python library to try to reconcile user supplied nationalities entered via a free text entry form to “preferred” nationalities listed in the FCO Register of Country Names.

Here’s another quick example of how to use fuzzyset to help clean a list of names, possibly repeated, that may include near misses or partial matches.

import pandas as pd
names=['S. Smith', 'John Smith','James Brown','John Brown','T. Smith','John Brown']
df=pd.DataFrame({'name':names})

# Set the thresh value (0..1) to tweak match strength
thresh=0.8

import fuzzyset
names=df['name'].tolist()

cleaner = fuzzyset.FuzzySet()
collisions=[]
for name in names:
    maybe=cleaner.get(name)
    # If there is a possible match, get a list of tuples back: (score, matchstring)
    # The following line finds the highest match score
    m=max(maybe,key=lambda item:item[0]) if maybe is not None else (-1,'')
    # If there is no match, or the max match score is below the threshold value,
    if not maybe or m[0] < thresh:         # assume that it's not a match and add name to list of "clean" names…         cleaner.add(name)     elif m[0] >= thresh:
        # But if there is a possible match, print a warning
        txt='assuming {} is a match with {} ({}) so not adding'.format(name,m[1],m[0])
        print(txt)
        # and add the name to a collisions list
        collisions.append((name,m))

#Now generate a simple report
print('------\n\n# Cleaning Report:\n\n## Match Set:\n{}\n\n------\n\n## Collisions:\n{}'.format(cleaner.exact_set, collisions))

The report looks something like this:

Sometimes, you may want to be alerted to exact matches; for example, if you are expecting the values to all be unique.

However, at other times, you may be happy to ignore duplicates, in which case you might consider dropping them from the names list. One way to do this is to convert the names list to a set, and back again, names=list(set(names)), although this changes the list order.

Alternatively, from the pandas dataframe column, just take unique values: names=df['name'].unique().tolist().

You may also want to know how many times duplicate (exact matches) occur. In such a case, we can list items that appear at list twice in the names list using the pandas dataframe value_counts() method:

#Get a count of the number of times each value occurs in a column, along with the value
vals=df['name'].value_counts()
#Select items where the value count is greater than one
vals[vals > 1]
#John Brown    2

PS Another way of detecting, and correcting, near matches is to use an application such as OpenRefine, in particular its clustering tool:

2 comments