Return-Path: Delivered-To: apmail-jakarta-commons-dev-archive@www.apache.org Received: (qmail 22514 invoked from network); 22 Oct 2003 22:52:21 -0000 Received: from daedalus.apache.org (HELO mail.apache.org) (208.185.179.12) by minotaur-2.apache.org with SMTP; 22 Oct 2003 22:52:21 -0000 Received: (qmail 2982 invoked by uid 500); 22 Oct 2003 22:52:03 -0000 Delivered-To: apmail-jakarta-commons-dev-archive@jakarta.apache.org Received: (qmail 2802 invoked by uid 500); 22 Oct 2003 22:52:02 -0000 Mailing-List: contact commons-dev-help@jakarta.apache.org; run by ezmlm Precedence: bulk List-Unsubscribe: List-Subscribe: List-Help: List-Post: List-Id: "Jakarta Commons Developers List" Reply-To: "Jakarta Commons Developers List" Delivered-To: mailing list commons-dev@jakarta.apache.org Received: (qmail 2789 invoked from network); 22 Oct 2003 22:52:02 -0000 Received: from unknown (HELO snowtide.com) (209.61.175.250) by daedalus.apache.org with SMTP; 22 Oct 2003 22:52:02 -0000 Received: from snowtide.com (snowtide.com [66.216.95.10]) (authenticated) by snowtide.com (8.11.6/8.11.6) with ESMTP id h9N0JYH09174 for ; Wed, 22 Oct 2003 19:19:34 -0500 Date: Wed, 22 Oct 2003 18:52:06 -0400 Subject: Re: [lang] [PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings Content-Type: text/plain; delsp=yes; charset=US-ASCII; format=flowed Mime-Version: 1.0 (Apple Message framework v552) From: Chas Emerick To: "Jakarta Commons Developers List" Content-Transfer-Encoding: 7bit In-Reply-To: Message-Id: <5C6FDDC0-04E2-11D8-AF96-0003931046C4@snowtide.com> X-Mailer: Apple Mail (2.552) X-Spam-Rating: daedalus.apache.org 1.6.2 0/1000/N X-Spam-Rating: minotaur-2.apache.org 1.6.2 0/1000/N Sure...although there's not much I can add with regard to the simpler test cases, I could use RandomStringUtils to generate some long strings to test behaviour that the patch addresses. The downside is that the algorithm is fundamentally quadratic, so doing such a test would significantly extend the time required to complete the java-lang tests (i.e. ~7 minutes to calculate the character-level edit distance of two strings of 50K each). If you think that's OK, I'll submit a patch to the test class. Chas Emerick | cemerick@snowtide.com http://www.snowtide.com Snowtide Informatics Systems, Inc. On Wednesday, October 22, 2003, at 04:53 PM, Henri Yandell wrote: > > Cool. Sounds good and I'll have a go at applying your patch. > > Any chance of you submitting your other unit tests for the class? > > Hen > > On Wed, 22 Oct 2003, Chas Emerick wrote: > >> I've never submitted a patch for an open-source project before (never >> got around to it lo these many years I guess), so I apologize for any >> errors in form or convention that I commit. >> >> I was planning on using the >> StringUtils.getLevenshteinDistance(String,String) method in a >> particular circumstance where I needed to get the edit distance >> between >> two large strings (anywhere between 10K - 500K characters each). >> However, I found that calling that method (as of v2.0 of commons-lang) >> would result in an OutOfMemoryError when strings of such lengths were >> provided. >> >> A quick look at the source revealed that the current implementation >> (which uses the sample impl. at http://www.merriampark.com/ld.htm) >> creates a matrix with dimensions corresponding to the lengths of the >> two strings provided. Clearly, a 100K x 100K int[] is problematic. >> >> Therefore, I've (mostly) rewritten the method using a two int arrays >> of >> the size of the first string's length, and the method now works >> properly with larger strings (beastly slow, but that's quadratic >> algorithms for you) . I've tested the new implementation against the >> ten or so testcases mentioned in the javadocs, as well as another >> half-dozen of my own, and everything looks good. >> >> If my mail client botches the patch diff, you can get it at >> http://www.snowtide.com/commons-lang-LDpatch.txt >> >> Chas Emerick | cemerick@snowtide.com >> >> http://www.snowtide.com >> Snowtide Informatics Systems, Inc. >> >> ====================================================================== >> == >> == >> --- StringUtils.java.old Wed Oct 22 12:58:04 2003 >> +++ StringUtils.java Wed Oct 22 13:51:36 2003 >> @@ -4255,8 +4255,8 @@ >> * another, where each change is a single character modification >> (deletion, >> * insertion or substitution).

>> * >> - *

This implementation of the Levenshtein distance algorithm >> - * is from > href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ >> ld. >> htm

>> + *

This implementation of the Levenshtein distance algorithm >> was originally based >> upon the one >> + * presented at > href="http://www.merriampark.com/ld.htm">http://www.merriampark.co >> m/ld.htm

>> * >> *
>>        * StringUtils.getLevenshteinDistance(null, *)             =
>> IllegalArgumentException
>> @@ -4281,76 +4281,64 @@
>>           if (s == null || t == null) {
>>               throw new IllegalArgumentException("Strings must not be
>> null");
>>           }
>> -        int d[][]; // matrix
>> -        int n; // length of s
>> -        int m; // length of t
>> -        int i; // iterates through s
>> -        int j; // iterates through t
>> -        char s_i; // ith character of s
>> -        char t_j; // jth character of t
>> -        int cost; // cost
>> -
>> -        // Step 1
>> -        n = s.length();
>> -        m = t.length();
>> +
>> +               /*
>> +               The difference between this impl. and the previous is
>> that, rather than cr
>> eating and retaining a matrix of size
>> +               s.length()+1 by t.length()+1, we maintain two
>> single-dimensional arrays of
>>   length s.length()+1.  The first, d,
>> +               is the 'current working' distance array that maintains
>> the newest distance
>>   cost counts as we iterate through
>> +               the characters of String s.  Each time we increment  
>> the
>> index of String t
>> we are comparing, d is copied to p,
>> +               the second int[].  Doing so allows us to retain the
>> previous cost counts a
>> s required by the algorithm
>> +               (taking the minimum of the cost count to the left, up
>> one, and diagonally
>> up and to the left of the current
>> +               cost count being calculated).  (Note that the arrays
>> aren't really copied
>> anymore, just switched...this is clearly much
>> +               better than cloning an array or doing a
>> System.arraycopy() each time throu
>> gh the outer loop.)
>> +
>> +               Effectively, the difference between the two
>> implementations is this one do
>> es not cause an out of memory condition
>> +               when calculating the LD over two very large strings.
>> +               */
>> +
>> +        int n = s.length(); // length of s
>> +        int m = t.length(); // length of t
>> +
>>           if (n == 0) {
>>               return m;
>> -        }
>> -        if (m == 0) {
>> +        } else if (m == 0) {
>>               return n;
>>           }
>> -        d = new int[n + 1][m + 1];
>>
>> -        // Step 2
>> -        for (i = 0; i <= n; i++) {
>> -            d[i][0] = i;
>> -        }
>> +        int p[] = new int[n+1]; //'previous' cost array, horizontally
>> +        int d[] = new int[n+1]; // cost array, horizontally
>> +               int _d[]; //placeholder to assist in swapping p and d
>>
>> -        for (j = 0; j <= m; j++) {
>> -            d[0][j] = j;
>> -        }
>> -
>> -        // Step 3
>> -        for (i = 1; i <= n; i++) {
>> -            s_i = s.charAt(i - 1);
>> -
>> -            // Step 4
>> -            for (j = 1; j <= m; j++) {
>> -                t_j = t.charAt(j - 1);
>> -
>> -                // Step 5
>> -                if (s_i == t_j) {
>> -                    cost = 0;
>> -                } else {
>> -                    cost = 1;
>> -                }
>> +        //indexes into strings s and t
>> +        int i; // iterates through s
>> +        int j; // iterates through t
>>
>> -                // Step 6
>> -                d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i -
>> 1][j - 1] + cost);
>> -            }
>> -        }
>> +        char t_j; // jth character of t
>>
>> -        // Step 7
>> -        return d[n][m];
>> -    }
>> +        int cost; // cost
>>
>> -    /**
>> -     * 

Gets the minimum of three int values.

>> - * >> - * @param a value 1 >> - * @param b value 2 >> - * @param c value 3 >> - * @return the smallest of the values >> - */ >> - private static int min(int a, int b, int c) { >> - // Method copied from NumberUtils to avoid dependency on >> subpackage >> - if (b < a) { >> - a = b; >> - } >> - if (c < a) { >> - a = c; >> + for (i = 0; i<=n; i++) { >> + p[i] = i; >> } >> - return a; >> + >> + for (j = 1; j<=m; j++) { >> + t_j = t.charAt(j-1); >> + d[0] = j; >> + >> + for (i=1; i<=n; i++) { >> + cost = s.charAt(i-1)==t_j ? 0 : 1; >> + >> + d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1), >> p[i-1]+cost); //minimum of c >> ell to the left+1, to the top+1, diagonally left and up +cost >> + } >> + >> + //copy current distance counts to 'previous row' distance >> counts >> + _d = p; >> + p = d; >> + d = _d; >> + } >> + >> + //our last action in the above loop was to switch d and p, so >> p now actual >> ly has the most recent cost counts >> + return p[n]; >> } >> >> } >> >> >> --------------------------------------------------------------------- >> To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org >> For additional commands, e-mail: commons-dev-help@jakarta.apache.org >> > > > --------------------------------------------------------------------- > To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org > For additional commands, e-mail: commons-dev-help@jakarta.apache.org > --------------------------------------------------------------------- To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org For additional commands, e-mail: commons-dev-help@jakarta.apache.org