lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From moraleslos <morales...@hotmail.com>
Subject Re: Lucene for name matching
Date Fri, 06 Apr 2007 14:11:01 GMT

Thanks guys!

I really really appreciate your feedback.  I didn't know a "simple" problem
like People name matching would be this complicated.  I knew there will be
some unusual circumstances or rules, but I did not realize how much work has
been done to solve parts of the problem (string matching in general).  I'll
take a good look at what you've suggested and I'll let you know what
approach I took if I get this thing completed.

-los



Grant Ingersoll-5 wrote:
> 
> I agree, SecondString was helpful to me.  Also have a look at William  
> Winkler's work at the US Census.  We did similar things to come up  
> with blocking criteria to get an initial division into duplicates,  
> unique and undecided.  Then we refined on the undecided set.  No  
> approach is going to be perfect and you have to make decisions about  
> time spent versus quality.
> 
> 
> On Apr 6, 2007, at 4:26 AM, eks dev wrote:
> 
>> I've been doing this in past couple of years, and yes we use Lucene  
>> for some key parts of the problem.
>> Basically, the problem you face is on how to run extremely high  
>> recall without compromising precision, hard!
>>
>> the key problem is performance, imagine you have DB with 10Mio  
>> persons you need to match against 10Mio from another list. Where  
>> you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance,  
>> it would need a couple of centuries to finish. What you need to do   
>> is to  define clever  "blocking criteria"  in order to reduce this O 
>> (n^2) complexity curse. Lucene comes in handy for this.
>>
>> Another problem is fuzzy similarity in this game, you need somehow  
>> to create kind of "index" for Edit distance, have a look at  
>> Lingpipe spell checker. Also, I guess you need to support   
>> synonyms  like  William/Bill (no fuzzy) and other semantics  
>> constraints not modelled by Edit Distance likes.
>>
>> web:
>> - google for "Record Linkage"
>> - look at Cohen's Secondstring project
>> - http://datamining.anu.edu.au/projects/linkage.html - they have  
>> very nice Python prototype
>>
>> search for "Fellegi- Sunter" articles as these are classics....
>>
>> it is only hard to do it, but doable, we are doing it on c.a 200Mio  
>> lists.
>>
>> Unfortunately, my  company does not give back  to the community as  
>> I would like...
>> anyhow, I hope this can help you
>>
>>>>
>>>> I was wondering if anyone has done people name matching using
>>>> Lucene.  For
>>>> example, I have a name coming from some external source that I
>>>> would like to
>>>> match with the one I have in my DB.  Lets say my DB contains the
>>>> name "John
>>>> Smith".  If the external source has something like "Smith John",
>>>> "Smith,
>>>> John", "J. Smith", etc., I would like to rate this matching based
>>>> on some %
>>>> of closeness for review later.  I've searched around a bit for
>>>> algorithms
>>>> and I kept seeing the Levenshtein distance algorithm which I'm sure
>>>> Lucene
>>>> uses under the hood.  So I trying to guage if Lucene is useful for
>>>> doing
>>>> something specific as this, or are there better algorithms and/or
>>>> software
>>>> out there that does name matching.  Thanks in advance!
>>>>
>>>> -los
>>>> -- 
>>>> View this message in context: http://www.nabble.com/Lucene-for-name-
>>>> matching-tf3533454.html#a9862342
>>>> Sent from the Lucene - Java Users mailing list archive at  
>>>> Nabble.com.
>>>>
>>>>
>>>> -------------------------------------------------------------------- 
>>>> -
>>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>>
>>>
>>> --------------------------
>>> Grant Ingersoll
>>> Center for Natural Language Processing
>>> http://www.cnlp.org
>>>
>>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>>> LuceneFAQ
>>>
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>>
>>
>> -- 
>> View this message in context: http://www.nabble.com/Lucene-for-name- 
>> matching-tf3533454.html#a9863587
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>>
>>
>>
>> 		
>> ___________________________________________________________
>> Now you can scan emails quickly with a reading pane. Get the new  
>> Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> ------------------------------------------------------
> Grant Ingersoll
> http://www.grantingersoll.com/
> http://lucene.grantingersoll.com
> http://www.paperoftheweek.com/
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-for-name-matching-tf3533454.html#a9872800
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message