lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Using Lucene to find duplicate/similar names
Date Wed, 16 Apr 2008 16:11:08 GMT
I believe there were some posts on this about a year ago.  Try  
searching in the archives for duplicate names, as well as "record  
linkage" or any other various synonyms that you can think of.  The  
short answer is Lucene is reasonable to attempt this with, but you may  
need some help.  The long answer is to dig into those archives and see  
the other recommendations.


On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:

> I'm new to Lucene, and would like to use it to find duplicate (or  
> similar) names in a contact list.  Is Lucene a good fit?
> We have a form where a user enters a company or person's name, and  
> we want the system to warn them if there is already a company or  
> person entered with the same or similar name.
> Based on the little I know of Lucene, I'm thinking an NGram  
> algorithm (based on characters, not words) would work best... but,  
> I'm not sure if Lucene takes proximity or edit distances into  
> account?  For example, say you have these two names:
> Andrew John
> John Andrew
> If a user enters Andy John, without proximity or edit distance,  
> these two names will match about the same, while, obviously, the  
> first name should be ranked higher.
> Thanks in advance for any help or advice.
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Lucene Helpful Hints:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message