Return-Path: Delivered-To: apmail-lucene-java-user-archive@www.apache.org Received: (qmail 14075 invoked from network); 15 Nov 2008 00:55:14 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 15 Nov 2008 00:55:14 -0000 Received: (qmail 54094 invoked by uid 500); 15 Nov 2008 00:55:15 -0000 Delivered-To: apmail-lucene-java-user-archive@lucene.apache.org Received: (qmail 54070 invoked by uid 500); 15 Nov 2008 00:55:14 -0000 Mailing-List: contact java-user-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: java-user@lucene.apache.org Delivered-To: mailing list java-user@lucene.apache.org Received: (qmail 54059 invoked by uid 99); 15 Nov 2008 00:55:14 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Nov 2008 16:55:14 -0800 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [206.190.39.227] (HELO web50712.mail.re2.yahoo.com) (206.190.39.227) by apache.org (qpsmtpd/0.29) with SMTP; Sat, 15 Nov 2008 00:53:53 +0000 Received: (qmail 21908 invoked by uid 60001); 15 Nov 2008 00:53:36 -0000 DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:MIME-Version:Content-Type:Content-Transfer-Encoding:Message-ID; b=I/olRp7yxag0NBFvHrJZeGn7a+xpeoP8eDWmrDwQoFvKPUXVhPm9GsD1LvIF/OL8U/c8f1x20urQLY1DMAHWY2Wq+qlGc1/h5HokwJcjGaglLIVMwLZmOkYMOogYgd/Pu9p7f/i0tIz33YXXl286aZttuWnkqLqJtgOBZdQrbL4=; X-YMail-OSG: opBiptUVM1mX1BMDJ_Uq72zynJTBWMcTOdk972L4Z45obCpEDaCTGrbojITX2g_EbBRshkrG9K1cQR8wosO_64piEMvEFmTwtsNRaqsu2Kw2afCJUZgp5R3EZWiGr4KAJeNxVoCCTV2hI.YaSw15z1O8fatTPHl8C2LAMWJCZfmnan1NRjmx7CA1ogKF Received: from [98.216.104.214] by web50712.mail.re2.yahoo.com via HTTP; Fri, 14 Nov 2008 16:53:35 PST X-Mailer: YahooMailRC/1155.32 YahooMailWebService/0.7.260.1 References: <20505283.post@talk.nabble.com> <491DDF01.9080407@gmail.com> <535223.79139.qm@web50711.mail.re2.yahoo.com> Date: Fri, 14 Nov 2008 16:53:35 -0800 (PST) From: Aaron Schon Subject: Re: Extremely Large Strings Comparison (slightly off-topic) To: java-user@lucene.apache.org MIME-Version: 1.0 Content-Type: text/plain; charset=iso-8859-1 Content-Transfer-Encoding: quoted-printable Message-ID: <943297.20398.qm@web50712.mail.re2.yahoo.com> X-Virus-Checked: Checked by ClamAV on apache.org Thanks for responding Jonathan. I will look into k-grams approach. =0A=0ATh= e objects could differ by small local changes. To provide some business con= text, the application requires indexing email messages and attachments. If = the attachments differ by some threshold (users making edits/reviews), the = attachment needs to be flagged and major/minor versioned.=0A=0AThanks=0AAS= =0A=0A=0A----- Original Message ----=0AFrom: Jonathan Young =0ATo: Aaron Schon =0ASent: Friday, November 14= , 2008 5:52:17 PM=0ASubject: RE: Extremely Large Strings Comparison (slight= ly off-topic)=0A=0AAaron - Although a na=EFve implementation of a Levenshte= in distance metric takes O(n*m) time, if you are willing to bound the maxim= um distance by k << n,m (e.g. if you aren't interested in distances greater= than k) then the distance calculation can take O(k*min(n,m)).=0A=0AAnother= approach would be to use shingles or k-grams from the original document. = I believe Lucene has some support for that.=0A=0AIt depends a lot on the wa= ys in which you expect the two strings to differ. The fact that it is Base= 64-encoded MIME content doesn't matter - are they actually objects which ar= e going to have small local changes, or are there likely to be changes whic= h cascade (e.g. if the data is compressed).=0A=0AI hope this helps,=0A=0A--= - Jonathan=0A=0A-----Original Message-----=0AFrom: Aaron Schon [mailto:aaro= n_schon@yahoo.com]=0ASent: Friday, November 14, 2008 5:00 PM=0ATo: java-use= r@lucene.apache.org=0ASubject: Extremely Large Strings Comparison (slightly= off-topic)=0A=0Ahi I need to compare two Base64 representation strings of = some MIME content that I am storing within a Lucene index. I need to effici= ently compare them to find the closest match to a query Base64 string , pos= t Lucene query.=0A=0AI am not sure of the best way to approach this, could = I compare the hashes and compute their similarity? Levenshtein distance see= ms hard because of the size of ths strings and seems inefficient? Is there = any other method you could suggest?=0A=0An.b: The idea is to not to determi= ne exact match or not, it is to compute a similarity metric. for example=0A= =0AJohn & Johnson (closer)=0Avs,=0AJohn & Jimmy (farther)=0A=0Atia,=0AAaron= =0A=0A=0A=0A=0A------------------------------------------------------------= ---------=0ATo unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org= =0AFor additional commands, e-mail: java-user-help@lucene.apache.org=0A=0A= =0A --------------------------------------------------------------------- To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org For additional commands, e-mail: java-user-help@lucene.apache.org