I've never submitted a patch for an opensource project before (never
got around to it lo these many years I guess), so I apologize for any
errors in form or convention that I commit.
I was planning on using the
StringUtils.getLevenshteinDistance(String,String) method in a
particular circumstance where I needed to get the edit distance between
two large strings (anywhere between 10K  500K characters each).
However, I found that calling that method (as of v2.0 of commonslang)
would result in an OutOfMemoryError when strings of such lengths were
provided.
A quick look at the source revealed that the current implementation
(which uses the sample impl. at http://www.merriampark.com/ld.htm)
creates a matrix with dimensions corresponding to the lengths of the
two strings provided. Clearly, a 100K x 100K int[] is problematic.
Therefore, I've (mostly) rewritten the method using a two int arrays of
the size of the first string's length, and the method now works
properly with larger strings (beastly slow, but that's quadratic
algorithms for you) . I've tested the new implementation against the
ten or so testcases mentioned in the javadocs, as well as another
halfdozen of my own, and everything looks good.
If my mail client botches the patch diff, you can get it at
http://www.snowtide.com/commonslangLDpatch.txt
Chas Emerick  cemerick@snowtide.com
http://www.snowtide.com
Snowtide Informatics Systems, Inc.
========================================================================
==
 StringUtils.java.old Wed Oct 22 12:58:04 2003
+++ StringUtils.java Wed Oct 22 13:51:36 2003
@@ 4255,8 +4255,8 @@
* another, where each change is a single character modification
(deletion,
* insertion or substitution).</p>
*
 * <p>This implementation of the Levenshtein distance algorithm
 * is from <a
href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.
htm</a></p>
+ * <p>This implementation of the Levenshtein distance algorithm
was originally based
upon the one
+ * presented at <a
href="http://www.merriampark.com/ld.htm">http://www.merriampark.co
m/ld.htm</a></p>
*
* <pre>
* StringUtils.getLevenshteinDistance(null, *) =
IllegalArgumentException
@@ 4281,76 +4281,64 @@
if (s == null  t == null) {
throw new IllegalArgumentException("Strings must not be
null");
}
 int d[][]; // matrix
 int n; // length of s
 int m; // length of t
 int i; // iterates through s
 int j; // iterates through t
 char s_i; // ith character of s
 char t_j; // jth character of t
 int cost; // cost

 // Step 1
 n = s.length();
 m = t.length();
+
+ /*
+ The difference between this impl. and the previous is
that, rather than cr
eating and retaining a matrix of size
+ s.length()+1 by t.length()+1, we maintain two
singledimensional arrays of
length s.length()+1. The first, d,
+ is the 'current working' distance array that maintains
the newest distance
cost counts as we iterate through
+ the characters of String s. Each time we increment the
index of String t
we are comparing, d is copied to p,
+ the second int[]. Doing so allows us to retain the
previous cost counts a
s required by the algorithm
+ (taking the minimum of the cost count to the left, up
one, and diagonally
up and to the left of the current
+ cost count being calculated). (Note that the arrays
aren't really copied
anymore, just switched...this is clearly much
+ better than cloning an array or doing a
System.arraycopy() each time throu
gh the outer loop.)
+
+ Effectively, the difference between the two
implementations is this one do
es not cause an out of memory condition
+ when calculating the LD over two very large strings.
+ */
+
+ int n = s.length(); // length of s
+ int m = t.length(); // length of t
+
if (n == 0) {
return m;
 }
 if (m == 0) {
+ } else if (m == 0) {
return n;
}
 d = new int[n + 1][m + 1];
 // Step 2
 for (i = 0; i <= n; i++) {
 d[i][0] = i;
 }
+ int p[] = new int[n+1]; //'previous' cost array, horizontally
+ int d[] = new int[n+1]; // cost array, horizontally
+ int _d[]; //placeholder to assist in swapping p and d
 for (j = 0; j <= m; j++) {
 d[0][j] = j;
 }

 // Step 3
 for (i = 1; i <= n; i++) {
 s_i = s.charAt(i  1);

 // Step 4
 for (j = 1; j <= m; j++) {
 t_j = t.charAt(j  1);

 // Step 5
 if (s_i == t_j) {
 cost = 0;
 } else {
 cost = 1;
 }
+ //indexes into strings s and t
+ int i; // iterates through s
+ int j; // iterates through t
 // Step 6
 d[i][j] = min(d[i  1][j] + 1, d[i][j  1] + 1, d[i 
1][j  1] + cost);
 }
 }
+ char t_j; // jth character of t
 // Step 7
 return d[n][m];
 }
+ int cost; // cost
 /**
 * <p>Gets the minimum of three <code>int</code> values.</p>
 *
 * @param a value 1
 * @param b value 2
 * @param c value 3
 * @return the smallest of the values
 */
 private static int min(int a, int b, int c) {
 // Method copied from NumberUtils to avoid dependency on
subpackage
 if (b < a) {
 a = b;
 }
 if (c < a) {
 a = c;
+ for (i = 0; i<=n; i++) {
+ p[i] = i;
}
 return a;
+
+ for (j = 1; j<=m; j++) {
+ t_j = t.charAt(j1);
+ d[0] = j;
+
+ for (i=1; i<=n; i++) {
+ cost = s.charAt(i1)==t_j ? 0 : 1;
+
+ d[i] = Math.min(Math.min(d[i1]+1, p[i]+1),
p[i1]+cost); //minimum of c
ell to the left+1, to the top+1, diagonally left and up +cost
+ }
+
+ //copy current distance counts to 'previous row' distance
counts
+ _d = p;
+ p = d;
+ d = _d;
+ }
+
+ //our last action in the above loop was to switch d and p, so
p now actual
ly has the most recent cost counts
+ return p[n];
}
}

To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
For additional commands, email: commonsdevhelp@jakarta.apache.org
