lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <grant.ingers...@gmail.com>
Subject Re: Lucene for name matching
Date Fri, 06 Apr 2007 12:14:41 GMT
I agree, SecondString was helpful to me.  Also have a look at William  
Winkler's work at the US Census.  We did similar things to come up  
with blocking criteria to get an initial division into duplicates,  
unique and undecided.  Then we refined on the undecided set.  No  
approach is going to be perfect and you have to make decisions about  
time spent versus quality.


On Apr 6, 2007, at 4:26 AM, eks dev wrote:

> I've been doing this in past couple of years, and yes we use Lucene  
> for some key parts of the problem.
> Basically, the problem you face is on how to run extremely high  
> recall without compromising precision, hard!
>
> the key problem is performance, imagine you have DB with 10Mio  
> persons you need to match against 10Mio from another list. Where  
> you start is 10E6 * 10E6 comparisons, e.g with pure Edit Distance,  
> it would need a couple of centuries to finish. What you need to do   
> is to  define clever  "blocking criteria"  in order to reduce this O 
> (n^2) complexity curse. Lucene comes in handy for this.
>
> Another problem is fuzzy similarity in this game, you need somehow  
> to create kind of "index" for Edit distance, have a look at  
> Lingpipe spell checker. Also, I guess you need to support   
> synonyms  like  William/Bill (no fuzzy) and other semantics  
> constraints not modelled by Edit Distance likes.
>
> web:
> - google for "Record Linkage"
> - look at Cohen's Secondstring project
> - http://datamining.anu.edu.au/projects/linkage.html - they have  
> very nice Python prototype
>
> search for "Fellegi- Sunter" articles as these are classics....
>
> it is only hard to do it, but doable, we are doing it on c.a 200Mio  
> lists.
>
> Unfortunately, my  company does not give back  to the community as  
> I would like...
> anyhow, I hope this can help you
>
>>>
>>> I was wondering if anyone has done people name matching using
>>> Lucene.  For
>>> example, I have a name coming from some external source that I
>>> would like to
>>> match with the one I have in my DB.  Lets say my DB contains the
>>> name "John
>>> Smith".  If the external source has something like "Smith John",
>>> "Smith,
>>> John", "J. Smith", etc., I would like to rate this matching based
>>> on some %
>>> of closeness for review later.  I've searched around a bit for
>>> algorithms
>>> and I kept seeing the Levenshtein distance algorithm which I'm sure
>>> Lucene
>>> uses under the hood.  So I trying to guage if Lucene is useful for
>>> doing
>>> something specific as this, or are there better algorithms and/or
>>> software
>>> out there that does name matching.  Thanks in advance!
>>>
>>> -los
>>> -- 
>>> View this message in context: http://www.nabble.com/Lucene-for-name-
>>> matching-tf3533454.html#a9862342
>>> Sent from the Lucene - Java Users mailing list archive at  
>>> Nabble.com.
>>>
>>>
>>> -------------------------------------------------------------------- 
>>> -
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> Center for Natural Language Processing
>> http://www.cnlp.org
>>
>> Read the Lucene Java FAQ at http://wiki.apache.org/jakarta-lucene/
>> LuceneFAQ
>>
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> -- 
> View this message in context: http://www.nabble.com/Lucene-for-name- 
> matching-tf3533454.html#a9863587
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>
>
>
>
> 		
> ___________________________________________________________
> Now you can scan emails quickly with a reading pane. Get the new  
> Yahoo! Mail. http://uk.docs.yahoo.com/nowyoucan.html
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

------------------------------------------------------
Grant Ingersoll
http://www.grantingersoll.com/
http://lucene.grantingersoll.com
http://www.paperoftheweek.com/



---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message