Cool. Sounds good and I'll have a go at applying your patch.
Any chance of you submitting your other unit tests for the class?
Hen
On Wed, 22 Oct 2003, Chas Emerick wrote:
> I've never submitted a patch for an opensource project before (never
> got around to it lo these many years I guess), so I apologize for any
> errors in form or convention that I commit.
>
> I was planning on using the
> StringUtils.getLevenshteinDistance(String,String) method in a
> particular circumstance where I needed to get the edit distance between
> two large strings (anywhere between 10K  500K characters each).
> However, I found that calling that method (as of v2.0 of commonslang)
> would result in an OutOfMemoryError when strings of such lengths were
> provided.
>
> A quick look at the source revealed that the current implementation
> (which uses the sample impl. at http://www.merriampark.com/ld.htm)
> creates a matrix with dimensions corresponding to the lengths of the
> two strings provided. Clearly, a 100K x 100K int[] is problematic.
>
> Therefore, I've (mostly) rewritten the method using a two int arrays of
> the size of the first string's length, and the method now works
> properly with larger strings (beastly slow, but that's quadratic
> algorithms for you) . I've tested the new implementation against the
> ten or so testcases mentioned in the javadocs, as well as another
> halfdozen of my own, and everything looks good.
>
> If my mail client botches the patch diff, you can get it at
> http://www.snowtide.com/commonslangLDpatch.txt
>
> Chas Emerick  cemerick@snowtide.com
>
> http://www.snowtide.com
> Snowtide Informatics Systems, Inc.
>
> ========================================================================
> ==
>  StringUtils.java.old Wed Oct 22 12:58:04 2003
> +++ StringUtils.java Wed Oct 22 13:51:36 2003
> @@ 4255,8 +4255,8 @@
> * another, where each change is a single character modification
> (deletion,
> * insertion or substitution).</p>
> *
>  * <p>This implementation of the Levenshtein distance algorithm
>  * is from <a
> href="http://www.merriampark.com/ld.htm">http://www.merriampark.com/ld.
> htm</a></p>
> + * <p>This implementation of the Levenshtein distance algorithm
> was originally based
> upon the one
> + * presented at <a
> href="http://www.merriampark.com/ld.htm">http://www.merriampark.co
> m/ld.htm</a></p>
> *
> * <pre>
> * StringUtils.getLevenshteinDistance(null, *) =
> IllegalArgumentException
> @@ 4281,76 +4281,64 @@
> if (s == null  t == null) {
> throw new IllegalArgumentException("Strings must not be
> null");
> }
>  int d[][]; // matrix
>  int n; // length of s
>  int m; // length of t
>  int i; // iterates through s
>  int j; // iterates through t
>  char s_i; // ith character of s
>  char t_j; // jth character of t
>  int cost; // cost
> 
>  // Step 1
>  n = s.length();
>  m = t.length();
> +
> + /*
> + The difference between this impl. and the previous is
> that, rather than cr
> eating and retaining a matrix of size
> + s.length()+1 by t.length()+1, we maintain two
> singledimensional arrays of
> length s.length()+1. The first, d,
> + is the 'current working' distance array that maintains
> the newest distance
> cost counts as we iterate through
> + the characters of String s. Each time we increment the
> index of String t
> we are comparing, d is copied to p,
> + the second int[]. Doing so allows us to retain the
> previous cost counts a
> s required by the algorithm
> + (taking the minimum of the cost count to the left, up
> one, and diagonally
> up and to the left of the current
> + cost count being calculated). (Note that the arrays
> aren't really copied
> anymore, just switched...this is clearly much
> + better than cloning an array or doing a
> System.arraycopy() each time throu
> gh the outer loop.)
> +
> + Effectively, the difference between the two
> implementations is this one do
> es not cause an out of memory condition
> + when calculating the LD over two very large strings.
> + */
> +
> + int n = s.length(); // length of s
> + int m = t.length(); // length of t
> +
> if (n == 0) {
> return m;
>  }
>  if (m == 0) {
> + } else if (m == 0) {
> return n;
> }
>  d = new int[n + 1][m + 1];
>
>  // Step 2
>  for (i = 0; i <= n; i++) {
>  d[i][0] = i;
>  }
> + int p[] = new int[n+1]; //'previous' cost array, horizontally
> + int d[] = new int[n+1]; // cost array, horizontally
> + int _d[]; //placeholder to assist in swapping p and d
>
>  for (j = 0; j <= m; j++) {
>  d[0][j] = j;
>  }
> 
>  // Step 3
>  for (i = 1; i <= n; i++) {
>  s_i = s.charAt(i  1);
> 
>  // Step 4
>  for (j = 1; j <= m; j++) {
>  t_j = t.charAt(j  1);
> 
>  // Step 5
>  if (s_i == t_j) {
>  cost = 0;
>  } else {
>  cost = 1;
>  }
> + //indexes into strings s and t
> + int i; // iterates through s
> + int j; // iterates through t
>
>  // Step 6
>  d[i][j] = min(d[i  1][j] + 1, d[i][j  1] + 1, d[i 
> 1][j  1] + cost);
>  }
>  }
> + char t_j; // jth character of t
>
>  // Step 7
>  return d[n][m];
>  }
> + int cost; // cost
>
>  /**
>  * <p>Gets the minimum of three <code>int</code> values.</p>
>  *
>  * @param a value 1
>  * @param b value 2
>  * @param c value 3
>  * @return the smallest of the values
>  */
>  private static int min(int a, int b, int c) {
>  // Method copied from NumberUtils to avoid dependency on
> subpackage
>  if (b < a) {
>  a = b;
>  }
>  if (c < a) {
>  a = c;
> + for (i = 0; i<=n; i++) {
> + p[i] = i;
> }
>  return a;
> +
> + for (j = 1; j<=m; j++) {
> + t_j = t.charAt(j1);
> + d[0] = j;
> +
> + for (i=1; i<=n; i++) {
> + cost = s.charAt(i1)==t_j ? 0 : 1;
> +
> + d[i] = Math.min(Math.min(d[i1]+1, p[i]+1),
> p[i1]+cost); //minimum of c
> ell to the left+1, to the top+1, diagonally left and up +cost
> + }
> +
> + //copy current distance counts to 'previous row' distance
> counts
> + _d = p;
> + p = d;
> + d = _d;
> + }
> +
> + //our last action in the above loop was to switch d and p, so
> p now actual
> ly has the most recent cost counts
> + return p[n];
> }
>
> }
>
>
> 
> To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
> For additional commands, email: commonsdevhelp@jakarta.apache.org
>

To unsubscribe, email: commonsdevunsubscribe@jakarta.apache.org
For additional commands, email: commonsdevhelp@jakarta.apache.org
