Return-Path: Delivered-To: apmail-lucene-dev-archive@www.apache.org Received: (qmail 52734 invoked from network); 27 Dec 2010 11:27:15 -0000 Received: from unknown (HELO mail.apache.org) (140.211.11.3) by 140.211.11.9 with SMTP; 27 Dec 2010 11:27:15 -0000 Received: (qmail 94033 invoked by uid 500); 27 Dec 2010 11:27:14 -0000 Delivered-To: apmail-lucene-dev-archive@lucene.apache.org Received: (qmail 93923 invoked by uid 500); 27 Dec 2010 11:27:14 -0000 Mailing-List: contact dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@lucene.apache.org Delivered-To: mailing list dev@lucene.apache.org Received: (qmail 93916 invoked by uid 99); 27 Dec 2010 11:27:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Dec 2010 11:27:13 +0000 X-ASF-Spam-Status: No, hits=1.0 required=10.0 tests=RCVD_DOUBLE_IP_LOOSE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [87.193.196.100] (HELO mail1.postdirekt.de) (87.193.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 27 Dec 2010 11:27:05 +0000 Received: from S03BO006.postdirekt.loc (unknown [172.22.233.4]) by mail1.postdirekt.de (Postfix) with ESMTP id A24F516FB92 for ; Mon, 27 Dec 2010 12:26:45 +0100 (CET) Received: from [172.22.233.4] by [87.193.196.98] with ESMTP; Mon, 27 Dec 2010 12:40:50 +0100 X-MimeOLE: Produced By Microsoft Exchange V6.5 Content-class: urn:content-classes:message MIME-Version: 1.0 Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable Subject: PS: Improving String Distance calculation performance Date: Mon, 27 Dec 2010 12:26:43 +0100 Message-ID: X-MS-Has-Attach: X-MS-TNEF-Correlator: Thread-Topic: PS: Improving String Distance calculation performance Thread-Index: AculsK3zDtLlY2AYSXu28SueQWf8EwAASVdAAAG5r0A= References: <4D1868B9.2070005@vtc.com> From: "Biedermann,S.,Fa. Post Direkt" To: X-Virus-Checked: Checked by ClamAV on apache.org Ups... I forgot to say, that the candiate only works if left.length() = <=3D right.length() ! -----Urspr=FCngliche Nachricht----- Von: Biedermann,S.,Fa. Post Direkt=20 Gesendet: Montag, 27. Dezember 2010 12:23 An: 'dev@lucene.apache.org' Betreff: Improving String Distance calculation performance Hi, this is my first post to this mailing list, so I first want to say hello = to all of you! You are doing a great job=20 In org.apache.lucene.search.FuzzyTermEnum I found an optimised = implementation of the Levenstein-Algorithms which makes use of the fact = that the algorithm can be aborted if a given minimum similarity cannot = be reached anymore. I isolated that algorithm into a subclass of = org.apache.lucene.spell.StringDistance, since we usually can make use of = this optimisation. With our current miminum similarity setting of 0.75 this algorithm needs = against our test data only about 22% of run time compared to the = original algorithm from the spell package. With a further optimisation candidate (see below) the runtime can be = further reduced by another third to only 14% of original Levenstein. So, my first question is: is it worth adding a further method to the = StringDistance-Interface: float getDistance(String left, String right, float minimumSimilarity) so that applications can make use of possible optimisations = (StringDistance-Implementations without optimisations would just skip = the minimSimilarity parameter)? The idea of the optimsation candidate is about calculating only those = fields in the "virtual" matrix that are near its diagonal. It is only a candidants since we have not prooven it to work. But with = all our test data (0.5 billion comparisons) there is no difference to = the original algorithm. Do you have any counter examples? Since this is my first post, is this the right mailing list? Best Regards, Sven Here is the code taken from FuzzyTermEnum with some modfications (p and = d are initialised somewhere else): public float getDistance(final String left, final String right, = float minimumSimilarity) { final int m =3D right.length(); final int n =3D left.length(); final int maxLength =3D Math.max(m, n); if (n =3D=3D 0) { //we don't have anything to compare. That means if we just = add //the letters for m we get the new word return (m =3D=3D 0) ? 1f : 0f; } if (m =3D=3D 0) { return 0f; } // be patient with rounding errors (1.0000001f instead of 1f). final int maxDistance =3D (int) ((1.0000001f-minimumSimilarity) = * maxLength); if (maxDistance < Math.abs(m-n)) { //just adding the characters of m to n or vice-versa results = in //too many edits //for example "pre" length is 3 and "prefixes" length is 8. = We can see that //given this optimal circumstance, the edit distance cannot be = less than 5. //which is 8-3 or more precisely Math.abs(3-8). //if our maximum edit distance is 4, then we can discard this = word //without looking at it. return 0.0f; } =20 // if no edits are allowed, strings must be equal=20 if (maxDistance =3D=3D 0) return left.equals(right) ? 1f : 0f; // init matrix d for (int i =3D 0; i<=3Dn; i++) { p[i] =3D i; } // start computing edit distance for (int j =3D 1; j<=3Dm; j++) { // iterates through target int bestPossibleEditDistance =3D m; final char t_j =3D right.charAt(j-1); // jth character of t d[0] =3D j; //-------> here is the optimisation candiates //only iterate through a maxDistance corridor final int startAt =3D Math.max(1, j - maxDistance ); final int finishAt =3D Math.min(n, maxDistance - 1 + j); =20 for (int i=3DstartAt; i<=3DfinishAt; ++i) { // iterates = through text //-------- // minimum of cell to the left+1, to the top+1, diagonally = left and up +(0|1) final char t_i =3D left.charAt(i-1); =20 if (t_j !=3D t_i) { d[i] =3D Math.min(Math.min(d[i-1], p[i]), p[i-1]) + 1; } else { d[i] =3D Math.min(Math.min(d[i-1], p[i]) + 1, p[i-1]); } bestPossibleEditDistance =3D = Math.min(bestPossibleEditDistance, d[i]); } //After calculating row i, the best possible edit distance //can be found by found by finding the smallest value in a = given column. //If the bestPossibleEditDistance is greater than the max = distance, abort. if (j > maxDistance && bestPossibleEditDistance > maxDistance) = { //equal is okay, but not greater //the closest the target can be to the text is just too far = away. //this target is leaving the party early. return 0.0f; } // copy current distance counts to 'previous row' distance = counts: swap p and d int _d[] =3D p; p =3D d; d =3D _d; } // our last action in the above loop was to switch d and p, so p = now // actually has the most recent cost counts // this will return less than 0.0 when the edit distance is // greater than the number of characters in the shorter word. // but this was the formula that was previously used in = FuzzyTermEnum, // so it has not been changed (even though minimumSimilarity = must be // greater than 0.0) return 1.0f - ((float)p[n] / (float) (maxLength)); } --------------------------------------------------------------------- To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org For additional commands, e-mail: dev-help@lucene.apache.org