commons-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chas Emerick <cemer...@snowtide.com>
Subject [PATCH] StringUtils.getLevenshteinDistance() fix for OutOfMemoryError when used with LONG strings
Date Wed, 22 Oct 2003 18:03:25 GMT
I've never submitted a patch for an open-source project before (never  
got around to it lo these many years I guess), so I apologize for any  
errors in form or convention that I commit.

I was planning on using the  
StringUtils.getLevenshteinDistance(String,String) method in a  
particular circumstance where I needed to get the edit distance between  
two large strings (anywhere between 10K - 500K characters each).   
However, I found that calling that method (as of v2.0 of commons-lang)  
would result in an OutOfMemoryError when strings of such lengths were  
provided.

A quick look at the source revealed that the current implementation  
(which uses the sample impl. at http://www.merriampark.com/ld.htm)  
creates a matrix with dimensions corresponding to the lengths of the  
two strings provided.  Clearly, a 100K x 100K int[] is problematic.

Therefore, I've (mostly) rewritten the method using a two int arrays of  
the size of the first string's length, and the method now works  
properly with larger strings (beastly slow, but that's quadratic  
algorithms for you) .  I've tested the new implementation against the  
ten or so testcases mentioned in the javadocs, as well as another  
half-dozen of my own, and everything looks good.

If my mail client botches the patch diff, you can get it at  
http://www.snowtide.com/commons-lang-LDpatch.txt

Chas Emerick   |   cemerick@snowtide.com

http://www.snowtide.com
Snowtide Informatics Systems, Inc.

======================================================================== 
==
--- StringUtils.java.old        Wed Oct 22 12:58:04 2003
+++ StringUtils.java    Wed Oct 22 13:51:36 2003
@@ -4255,8 +4255,8 @@
       * another, where each change is a single character modification  
(deletion,
       * insertion or substitution).</p>
       *
-     * <p>This implementation of the Levenshtein distance algorithm
-     * is from <a  
href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.
htm</a></p>
+     * <p>This implementation of the Levenshtein distance algorithm  
was originally based
upon the one
+     * presented at <a  
href="http://www.merriampark.com/ld.htm">http://www.merriampark.co
m/ld.htm</a></p>
       *
       * <pre>
       * StringUtils.getLevenshteinDistance(null, *)             =  
IllegalArgumentException
@@ -4281,76 +4281,64 @@
          if (s == null || t == null) {
              throw new IllegalArgumentException("Strings must not be  
null");
          }
-        int d[][]; // matrix
-        int n; // length of s
-        int m; // length of t
-        int i; // iterates through s
-        int j; // iterates through t
-        char s_i; // ith character of s
-        char t_j; // jth character of t
-        int cost; // cost
-
-        // Step 1
-        n = s.length();
-        m = t.length();
+
+               /*
+               The difference between this impl. and the previous is  
that, rather than cr
eating and retaining a matrix of size
+               s.length()+1 by t.length()+1, we maintain two  
single-dimensional arrays of
  length s.length()+1.  The first, d,
+               is the 'current working' distance array that maintains  
the newest distance
  cost counts as we iterate through
+               the characters of String s.  Each time we increment the  
index of String t
we are comparing, d is copied to p,
+               the second int[].  Doing so allows us to retain the  
previous cost counts a
s required by the algorithm
+               (taking the minimum of the cost count to the left, up  
one, and diagonally
up and to the left of the current
+               cost count being calculated).  (Note that the arrays  
aren't really copied
anymore, just switched...this is clearly much
+               better than cloning an array or doing a  
System.arraycopy() each time throu
gh the outer loop.)
+
+               Effectively, the difference between the two  
implementations is this one do
es not cause an out of memory condition
+               when calculating the LD over two very large strings.
+               */
+
+        int n = s.length(); // length of s
+        int m = t.length(); // length of t
+
          if (n == 0) {
              return m;
-        }
-        if (m == 0) {
+        } else if (m == 0) {
              return n;
          }
-        d = new int[n + 1][m + 1];

-        // Step 2
-        for (i = 0; i <= n; i++) {
-            d[i][0] = i;
-        }
+        int p[] = new int[n+1]; //'previous' cost array, horizontally
+        int d[] = new int[n+1]; // cost array, horizontally
+               int _d[]; //placeholder to assist in swapping p and d

-        for (j = 0; j <= m; j++) {
-            d[0][j] = j;
-        }
-
-        // Step 3
-        for (i = 1; i <= n; i++) {
-            s_i = s.charAt(i - 1);
-
-            // Step 4
-            for (j = 1; j <= m; j++) {
-                t_j = t.charAt(j - 1);
-
-                // Step 5
-                if (s_i == t_j) {
-                    cost = 0;
-                } else {
-                    cost = 1;
-                }
+        //indexes into strings s and t
+        int i; // iterates through s
+        int j; // iterates through t

-                // Step 6
-                d[i][j] = min(d[i - 1][j] + 1, d[i][j - 1] + 1, d[i -  
1][j - 1] + cost);
-            }
-        }
+        char t_j; // jth character of t

-        // Step 7
-        return d[n][m];
-    }
+        int cost; // cost

-    /**
-     * <p>Gets the minimum of three <code>int</code> values.</p>
-     *
-     * @param a  value 1
-     * @param b  value 2
-     * @param c  value 3
-     * @return  the smallest of the values
-     */
-    private static int min(int a, int b, int c) {
-        // Method copied from NumberUtils to avoid dependency on  
subpackage
-        if (b < a) {
-            a = b;
-        }
-        if (c < a) {
-            a = c;
+        for (i = 0; i<=n; i++) {
+            p[i] = i;
          }
-        return a;
+
+        for (j = 1; j<=m; j++) {
+            t_j = t.charAt(j-1);
+            d[0] = j;
+
+            for (i=1; i<=n; i++) {
+                cost = s.charAt(i-1)==t_j ? 0 : 1;
+
+                d[i] = Math.min(Math.min(d[i-1]+1, p[i]+1),  
p[i-1]+cost);  //minimum of c
ell to the left+1, to the top+1, diagonally left and up +cost
+            }
+
+            //copy current distance counts to 'previous row' distance  
counts
+            _d = p;
+            p = d;
+            d = _d;
+        }
+
+        //our last action in the above loop was to switch d and p, so  
p now actual
ly has the most recent cost counts
+        return p[n];
      }

  }


---------------------------------------------------------------------
To unsubscribe, e-mail: commons-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: commons-dev-help@jakarta.apache.org


Mime
View raw message