Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 84691 invoked from network); 4 Sep 2008 13:19:33 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 4 Sep 2008 13:19:33 -0000 Received: (qmail 72714 invoked by uid 500); 4 Sep 2008 13:19:24 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 72685 invoked by uid 500); 4 Sep 2008 13:19:24 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 72674 invoked by uid 99); 4 Sep 2008 13:19:24 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 06:19:24 -0700 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of karl.wettin@gmail.com designates 64.233.182.189 as permitted sender) Received: from [64.233.182.189] (HELO nf-out-0910.google.com) (64.233.182.189) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 04 Sep 2008 13:18:26 +0000 Received: by nf-out-0910.google.com with SMTP id g16so652767nfd.15 for ; Thu, 04 Sep 2008 06:18:39 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:received:received:message-id:from:to :in-reply-to:content-type:content-transfer-encoding:mime-version :subject:x-priority:date:references:x-mailer; bh=QtdnQiX1nuy7J/aXtbfe/h/+llrq7XUiAnJRrzNEZ10=; b=Bv8PeOsZFQhIbpdjRpCL+cY80NjrFeDbLxJiJe8H4pI6/YnG1C5IrFZLNbsH57dcly hvXNxt11VB+PFc6xZwwm57j7AghkXfJglP10dbw0KN3vshGxaWKbOfOxtEDwn0vng/Dq gWyDNkcdpdNcAgLJ94g0qgq+XtyJeHEevYSY0= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=message-id:from:to:in-reply-to:content-type :content-transfer-encoding:mime-version:subject:x-priority:date :references:x-mailer; b=C/nXVhpTPB2JVoddTP0ZKdspFy1A16bKD7HuH4h31sBkrAEuWgGMS5zwgzY4qTzYxP /f42LJ/L35rLUNwzj8sNrgf2uJXOQ1A/JpBT6IdgroEEN5jLCesyl0qbF2R8gc+Wf/WK xjtJFHpiSsc20YrW8GN83b3+EfTaNYEK7fnK4= Received: by 10.187.168.15 with SMTP id v15mr2382545fao.100.1220534318425; Thu, 04 Sep 2008 06:18:38 -0700 (PDT) Received: from kodapan.lan ( [83.249.107.81]) by mx.google.com with ESMTPS id 34sm50846869nfu.24.2008.09.04.06.18.34 (version=SSLv3 cipher=RC4-MD5); Thu, 04 Sep 2008 06:18:35 -0700 (PDT) Message-Id: <09416F9F-1379-454F-888A-419D6EA2822C@gmail.com> From: Karl Wettin To: java-user@lucene.apache.org In-Reply-To: <1220482735.v2.mailanyonewebmail-253596@fuse49> Content-Type: text/plain; charset=US-ASCII; format=flowed; delsp=yes Content-Transfer-Encoding: 7bit Mime-Version: 1.0 (Apple Message framework v926) Subject: Re: Similarity percentage between two Strings X-Priority: 3 Date: Thu, 4 Sep 2008 15:18:32 +0200 References: <1220482735.v2.mailanyonewebmail-253596@fuse49> X-Mailer: Apple Mail (2.926) X-Virus-Checked: Checked by ClamAV on apache.org I would create 1-5 ngram sized shingles and measure the distance using Tanimoto coefficient. That would probably work out just fine. You might want to add more weight the greater the size of the shingle. There are shingle filters in lucene/java/contrib/analyzers and there is a Tanimoto distance in lucene/mahout/. Feel free to report back on how well it works. karl 4 sep 2008 kl. 00.58 skrev Thiago Moreira: > > Well, the similar definition that I'm looking for is the number > 2, maybe the number 3, but to start the number 2 is enough. If you > guys think that is not a Lucene problem what else tool can I use to > implement this requirement?? > > Thanks > Thiago Moreira > Software Engineer > tmoreira@liferay.com > Liferay, Inc. > Enterprise. Open Source. For Life. > > > N. Hira wrote: >> >> I don't know how much of this is a Lucene problem, but -- as I'm >> sure you will inevitably hear from others on the list -- it depends >> on what your definition of "similar" is. >> >> By similar, do you mean: >> 1. Identical, except for variations in case (upper/lower) >> 2. Allow 1., but also allow prefixes/suffixes (e.g., "FW: " or >> "... (summary") >> 3. Allow 1., 2. and permit some new terms ... how many? >> 4. Allow all of the above and allow some changes to terms using >> stemming (E.g., "Google releases Chrome" is similar to "Google >> announces the release of its new Chrome web browser") >> .... >> >> I'm sure you see where this is going. So ... how do you define >> similar? >> >> Good luck! >> >> -h >> ---------------------------------------------------------------------- >> Hira, N.R. >> Cognocys, Inc. >> >> On 03-Sep-2008, at 2:52 PM, Thiago Moreira wrote: >> >>> >>> Hey all, >>> >>> I want to know how much two Strings are similar! The thing is: >>> I'm processing an email box and I want to group all messages that >>> have the subject similar, makes sense?? I looked on the >>> documentation but I didn't find how to accomplish this. It's not >>> necessary add the messages or the subjects on some kind of index. >>> I'm using 2.3.2 version of Lucene. >>> >>> Anyone has some idea? >>> >>> Thanks in advance. >>> -- >>> Thiago Moreira >>> Software Engineer >>> tmoreira@liferay.com >>> Liferay, Inc. >>> Enterprise. Open Source. For Life. >>> --------------------------------------------------------------------- >>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >>> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org >> For additional commands, e-mail: java-user-help@lucene.apache.org >> >> > --------------------------------------------------------------------- > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org > For additional commands, e-mail: java-user-help@lucene.apache.org --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org