In the post What Nationality Did You Say You Were, Again? I showed how we could use the fuzzyset
python library to try to reconcile user supplied nationalities entered via a free text entry form to “preferred” nationalities listed in the FCO Register of Country Names.
Here’s another quick example of how to use fuzzyset
to help clean a list of names, possibly repeated, that may include near misses or partial matches.
import pandas as pd names=['S. Smith', 'John Smith','James Brown','John Brown','T. Smith','John Brown'] df=pd.DataFrame({'name':names}) # Set the thresh value (0..1) to tweak match strength thresh=0.8 import fuzzyset names=df['name'].tolist() cleaner = fuzzyset.FuzzySet() collisions=[] for name in names: maybe=cleaner.get(name) # If there is a possible match, get a list of tuples back: (score, matchstring) # The following line finds the highest match score m=max(maybe,key=lambda item:item[0]) if maybe is not None else (-1,'') # If there is no match, or the max match score is below the threshold value, if not maybe or m[0] < thresh: # assume that it's not a match and add name to list of "clean" names… cleaner.add(name) elif m[0] >= thresh: # But if there is a possible match, print a warning txt='assuming {} is a match with {} ({}) so not adding'.format(name,m[1],m[0]) print(txt) # and add the name to a collisions list collisions.append((name,m)) #Now generate a simple report print('------\n\n# Cleaning Report:\n\n## Match Set:\n{}\n\n------\n\n## Collisions:\n{}'.format(cleaner.exact_set, collisions))
The report looks something like this:
Sometimes, you may want to be alerted to exact matches; for example, if you are expecting the values to all be unique.
However, at other times, you may be happy to ignore duplicates, in which case you might consider dropping them from the names
list. One way to do this is to convert the names
list to a set, and back again, names=list(set(names))
, although this changes the list order.
Alternatively, from the pandas dataframe column, just take unique values: names=df['name'].unique().tolist()
.
You may also want to know how many times duplicate (exact matches) occur. In such a case, we can list items that appear at list twice in the names list using the pandas dataframe value_counts()
method:
#Get a count of the number of times each value occurs in a column, along with the value vals=df['name'].value_counts() #Select items where the value count is greater than one vals[vals > 1] #John Brown 2
PS Another way of detecting, and correcting, near matches is to use an application such as OpenRefine, in particular its clustering tool:
I’ve had good success reconciling large national datasets of place names in OpenRefine using the Reconcile-csv server:
http://okfnlabs.org/reconcile-csv/
@sorbus I started looking at reconciler containers some time ago; IIRC it worked but linking containers was a bit clunky