lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Grant Ingersoll <>
Subject Re: Using Lucene for Moderate Similarity Check..
Date Tue, 09 Jun 2009 14:34:01 GMT
Hi Ravi,

Lucene can enable this, but you will have some work to do on top of  
it.  If you search the archives for record linkage (

) you will find a fair amount of discussion on this.  Also, in  
somewhat shameless marketing mode, my co-author, Tom Morton, is just  
putting the finishing touches on a chapter in our book called Taming  
Text ( which discusses some of the  
techniques involved in making this stuff happen.  That chapter should  
be released in the next few weeks.

You likely can get a basic system working pretty quickly with what is  
in Lucene, but then the next level is often more difficult.  You often  
end up with a rules system that can become brittle with this  
approach.  An alternative is to apply some type of machine learning  
approach.  You could also look at this as a clustering problem, which  
Mahout (or other clustering tools) could be helpful in solving.

Finally, just know there will be a human in the loop with any  
approach.  The goal is to minimize the number of matches that a person  
has to check.

Hope this helps,

On Jun 9, 2009, at 4:16 AM, RaviK Thakur wrote:

> Hello All,
>      I want to check the feasibility of using Lucene for similarity  
> check
> between the two flat csv files. The actual requirement is like this:  
> We
> have two files each containing the information of customers like their
> name, address, pin code etc. Some customers may be in common in both  
> the
> files. We want to find the customer that are common in these files.  
> But the
> match should be on attribute basis. If the name of the customer  
> matches in
> one file to the name of the customer in another file, then match the
> address, if it matches then match pin code and so on. But the main
> consideration is that this matching is not exact. If the name of  
> customer
> matches say 80% then it may be termed as match. For example, if  
> ABDUL is
> matched with ABDULLAH, it should be termed as a match. In this  
> fashion each
> record of one file will be matched with each record of another file.  
> The
> output of this procedure will be another file containing the matched
> record.
> Can anyone please suggest the applicability of lucene for this  
> requirement.
> May in the form of Pros n Cons.
> Thanks in advance:-)
> Ravi
> ______________________________________________________________________
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

Grant Ingersoll

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message