lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Using Lucene to find duplicate/similar names
Date Wed, 16 Apr 2008 21:56:04 GMT
NGrams will do ok, 
depends a lot on what you are up to, if there is a person looking at result lists making decision,
it will work fine as default TF/IDF similarity will give you ok order of hits, but if you
need to set some cutoff value to decide automatically if this is a match or not, then you
should start digging deeper into record linkage theory.

also, if you need to do deduplication and you have a lot of records then this is going to
be slow, e.g you have 20Mio documents means you need to make 20 Mio searches .... there you
need to pull some  tricks 

and finally, make your analyzers smart, normalize data, remove noise like "Mr. Dr, senior...",
lowercase, punctuation, spec characters... standard 

have fun, this is nice problem, looks simple at first sight, but is not!  I've spent last
6 years doing this like this at 500Mio+ documets scale   and is fun, Lucene helps a lot ther,
this is nice inverted index lib :)





----- Original Message ----
> From: Andy DePue <andy@marathon-man.com>
> To: java-user@lucene.apache.org
> Sent: Wednesday, 16 April, 2008 7:10:42 PM
> Subject: Re: Using Lucene to find duplicate/similar names
> 
> Thanks for the pointer.  I found the thread, and there is certainly some 
> interesting information there.  I'd like to stick to what Lucene has 
> available today, mainly because I lack the time to implement anything 
> more than that.  I originally thought Levenshtein, but then realized 
> that Lucene would probably have to do a whole index scan for that?  I 
> don't need anything too fancy, so I'm still wondering if NGram with some 
> sort of proximity ranking would do the trick.  By proximity, I mean, how 
> closely the NGrams in the document field match in proximity and order to 
> each other as the same NGrams in the search string.  I'm hoping NGrams 
> would avoid the need for a whole index scan.  Does Lucene already factor 
> this into its hit score, or would I need to do some custom work?
> 
>   - Andy
> 
> Grant Ingersoll wrote:
> > I believe there were some posts on this about a year ago.  Try 
> > searching in the archives for duplicate names, as well as "record 
> > linkage" or any other various synonyms that you can think of.  The 
> > short answer is Lucene is reasonable to attempt this with, but you may 
> > need some help.  The long answer is to dig into those archives and see 
> > the other recommendations.
> >
> > -Grant
> >
> > On Apr 16, 2008, at 12:37 PM, Andy DePue wrote:
> >
> >> I'm new to Lucene, and would like to use it to find duplicate (or 
> >> similar) names in a contact list.  Is Lucene a good fit?
> >> We have a form where a user enters a company or person's name, and we 
> >> want the system to warn them if there is already a company or person 
> >> entered with the same or similar name.
> >> Based on the little I know of Lucene, I'm thinking an NGram algorithm 
> >> (based on characters, not words) would work best... but, I'm not sure 
> >> if Lucene takes proximity or edit distances into account?  For 
> >> example, say you have these two names:
> >> Andrew John
> >> John Andrew
> >>
> >> If a user enters Andy John, without proximity or edit distance, these 
> >> two names will match about the same, while, obviously, the first name 
> >> should be ranked higher.
> >> Thanks in advance for any help or advice.
> >>
> >> ---------------------------------------------------------------------
> >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >> For additional commands, e-mail: java-user-help@lucene.apache.org
> >>
> >
> > --------------------------
> > Grant Ingersoll
> >
> > Lucene Helpful Hints:
> > http://wiki.apache.org/lucene-java/BasicsOfPerformance
> > http://wiki.apache.org/lucene-java/LuceneFAQ
> >
> >
> >
> >
> >
> >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
> >
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org



      ___________________________________________________________ 
Yahoo! For Good helps you make a difference  

http://uk.promotions.yahoo.com/forgood/


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message