Evaluating an Excel Fuzzy Matching Macro Using Logistic Regression in R

For work today I needed to compare a list of names (and spousal names, when available) against an in-house database dump of names.  In an ideal world, the first and last names of the two lists would all match and that would be the end of the story.  Obviously it’s not an ideal world, and names don’t always match.

That being said, the task was this: after matching up last names exactly between both lists, try to find matches on first name that may or may not be exact.  The first name could be the primary first name, or it could be the first name of the spouse.  

The first thing I looked for was a fuzzy matching macro that I could easily use in excel that would give me a few scores: (1) First name match score, (2) Spouse’s First name match score, (3) Spouse’s last name match score; to be an exact match rather than a fuzzy one.

I found a fuzzy matching script that allowed me to make these comparisons across columns in the same row here, inserted it as a VBA module function, and compared the names across lists (within the same row, remember) for all records.

Once I started sorting through the potential matches based on how close to 1 the scores were, I started noticing some extra variables that seemed to predict a decent match.  For one, when there was a parenthetical insertion in the first name in the in-house dump of names, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “William (Bill)”).  Don’t ask me why that happened!  Also, another thing I noticed was that if there was an ampersand in the first name in the in-house list, that row seemed to be more likely to be a match (e.g. first name in the in-house list was “Richard & Donna”).  Again, don’t ask me why that happened.  I haven’t a clue!

I coded the file for what seemed to be a match and what did not, and then was curious to enter in the Scores and matching column into R to do a Logistic Regression and find out how well I could model my own choices of matches between lists based on the fuzzy matching scores.  

 

Again, the First.Name.Match variable was a continuous fuzzy matching score, Parenthetical and Ampersand were simple binary variables describing whether or not a parenthetical insertion or ampersand were within the first name of the in-house list, Spouse.First.Name.Match:Spouse.Last.Name.Match is an interaction term stating that so long as the Spouse’s Last Name is an exact match, run the Spouse.First.Name.Match scores against event probability.

As you can see, all variables significantly predicted the probability of me judging that row to be a probable match.  In fact, when I selected those rows with a predicted event probability of .5 or higher, about 78% of my actual matches showed up.

In conclusion, if strapped for time, this fuzzy matching script is actually quite nice.

Advertisements

Leave a Reply

Fill in your details below or click an icon to log in:

WordPress.com Logo

You are commenting using your WordPress.com account. Log Out / Change )

Twitter picture

You are commenting using your Twitter account. Log Out / Change )

Facebook photo

You are commenting using your Facebook account. Log Out / Change )

Google+ photo

You are commenting using your Google+ account. Log Out / Change )

Connecting to %s