Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 38991 invoked from network); 9 Jun 2009 14:34:24 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 9 Jun 2009 14:34:24 -0000 Received: (qmail 64214 invoked by uid 500); 9 Jun 2009 14:34:33 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 64172 invoked by uid 500); 9 Jun 2009 14:34:33 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 64162 invoked by uid 99); 9 Jun 2009 14:34:33 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jun 2009 14:34:33 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [208.97.132.81] (HELO spunkymail-a5.g.dreamhost.com) (208.97.132.81) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 09 Jun 2009 14:34:23 +0000 Received: from [192.168.0.105] (adsl-074-229-189-244.sip.rmo.bellsouth.net [74.229.189.244]) by spunkymail-a5.g.dreamhost.com (Postfix) with ESMTP id 5BE4A20DBA for ; Tue, 9 Jun 2009 07:34:02 -0700 (PDT) Message-Id: From: Grant Ingersoll To: java-user@lucene.apache.org In-Reply-To: Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v935.3) Subject: Re: Using Lucene for Moderate Similarity Check.. Date: Tue, 9 Jun 2009 10:34:01 -0400 References: X-Mailer: Apple Mail (2.935.3) X-Virus-Checked: Checked by ClamAV on apache.org Hi Ravi, Lucene can enable this, but you will have some work to do on top of it. If you search the archives for record linkage (http://www.lucidimagination.com/search/?q=record+linkage ) you will find a fair amount of discussion on this. Also, in somewhat shameless marketing mode, my co-author, Tom Morton, is just putting the finishing touches on a chapter in our book called Taming Text (http://www.manning.com/ingersoll) which discusses some of the techniques involved in making this stuff happen. That chapter should be released in the next few weeks. You likely can get a basic system working pretty quickly with what is in Lucene, but then the next level is often more difficult. You often end up with a rules system that can become brittle with this approach. An alternative is to apply some type of machine learning approach. You could also look at this as a clustering problem, which Mahout (or other clustering tools) could be helpful in solving. Finally, just know there will be a human in the loop with any approach. The goal is to minimize the number of matches that a person has to check. Hope this helps, Grant On Jun 9, 2009, at 4:16 AM, RaviK Thakur wrote: > > Hello All, > I want to check the feasibility of using Lucene for similarity > check > between the two flat csv files. The actual requirement is like this: > We > have two files each containing the information of customers like their > name, address, pin code etc. Some customers may be in common in both > the > files. We want to find the customer that are common in these files. > But the > match should be on attribute basis. If the name of the customer > matches in > one file to the name of the customer in another file, then match the > address, if it matches then match pin code and so on. But the main > consideration is that this matching is not exact. If the name of > customer > matches say 80% then it may be termed as match. For example, if > ABDUL is > matched with ABDULLAH, it should be termed as a match. In this > fashion each > record of one file will be matched with each record of another file. > The > output of this procedure will be another file containing the matched > record. > > Can anyone please suggest the applicability of lucene for this > requirement. > May in the form of Pros n Cons. > > Thanks in advance:-) > Ravi > > > ______________________________________________________________________ > > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org > -------------------------- Grant Ingersoll http://www.lucidimagination.com/ Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using Solr/Lucene: http://www.lucidimagination.com/search --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org