lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From eks dev <eks...@yahoo.co.uk>
Subject Re: Lucene for name matching
Date Fri, 06 Apr 2007 08:26:02 GMT
I've been doing this in past couple of years, and yes we use Lucene for some key parts of the
problem.
Basically, the problem you face is on how to run extremely high recall without compromising
precision, hard!

the key problem is performance, imagine you have DB with 10Mio persons you need to match against
10Mio from another list. Where you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance,
it would need a couple of centuries to finish. What you need to do  is to  define clever 
"blocking criteria"  in order to reduce this O(n^2) complexity curse. Lucene comes in handy
for this. 

Another problem is fuzzy similarity in this game, you need somehow to create kind of "index"
for Edit distance, have a look at Lingpipe spell checker. Also, I guess you need to support
 synonyms  like  William/Bill (no fuzzy) and other semantics constraints not modelled by Edit
Distance likes.

web:
- google for "Record Linkage"   
- look at Cohen's Secondstring project
- http://datamining.anu.edu.au/projects/linkage.html - they have very nice Python prototype

search for "Fellegi- Sunter" articles as these are classics....

it is only hard to do it, but doable, we are doing it on c.a 200Mio lists.

Unfortunately, my  company does not give back  to the community as I would like...
anyhow, I hope this can help you
 
>>
>> I was wondering if anyone has done people name matching using  
>> Lucene.  For
>> example, I have a name coming from some external source that I  
>> would like to
>> match with the one I have in my DB.  Lets say my DB contains the  
>> name "John
>> Smith".  If the external source has something like "Smith John",  
>> "Smith,
>> John", "J. Smith", etc., I would like to rate this matching based  
>> on some %
>> of closeness for review later.  I've searched around a bit for  
>> algorithms
>> and I kept seeing the Levenshtein distance algorithm which I'm sure  
>> Lucene
>> uses under the hood.  So I trying to guage if Lucene is useful for  
>> doing
>> something specific as this, or are there better algorithms and/or  
>> software
>> out there that does name matching.  Thanks in advance!
>>
>> -los
>> -- 
>> View this message in context: http://www.nabble.com/Lucene-for-name- 
>> matching-tf3533454.html#a9862342
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
> 
> --------------------------
> Grant Ingersoll
> Center for Natural Language Processing
> http://www.cnlp.org
> 
> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/ 
> LuceneFAQ
> 
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://www.nabble.com/Lucene-for-name-matching-tf3533454.html#a9863587
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org






		
___________________________________________________________ 
Now you can scan emails quickly with a reading pane. Get the new Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message