Bob Carpenter's FuzzyQuery refactoring

Key: LUCENE691
URL: http://issues.apache.org/jira/browse/LUCENE691
Project: Lucene  Java
Issue Type: Improvement
Components: Search
Reporter: Otis Gospodnetic
Priority: Minor
I'll just paste Bob's complete email here.
I refactored the org.apache.lucene.search.FuzzyTermEnum
edit distance implementation. It now only uses a single
pair of arrays, and those never get resized. That required
turning the order of text/target around in the loops. You'll
see that with the pair of arrays method, they get reused
handoverhand, and are assigned to local variables in the
tight loops.
I removed the calculation of min distance and replaced
it with a boolean  the min wasn't needed, only the test vs.
the max. I also flipped some variables around so there's
one less addition in the very inner loop and the arrays are
now looping in the ordinary way (starting at 0 with a < comparison).
I also commented out the redundant definition of the public close()
[which just called super.close() and had none of its own doc.]
I also just compute the max distance each time rather than
fiddling with an array  it's just a little arithmetic done once
per term, but that could be put back.
I also rewrote min(int,int,int) to get rid of intermediate
assignments. Is there a lib for this kind of thing?
An intermediate refactoring that does the handoverhand
with the existing array and resizing strategy is in
FuzzyTermEnum.intermed.java.
I ran the unit tests as follows on both versions (my hat's off to the
build.xml author(s)  this all just worked out of the box and was
really easy to follow the first through):
C:\java\lucene2.0.0>ant Djunit.includes="" Dtestcase=TestFuzzyQuery test
Buildfile: build.xml
javaccuptodatecheck:
javaccnotice:
init:
common.compilecore:
[javac] Compiling 1 source file to
C:\java\lucene2.0.0\build\classes\java
compilecore:
compiledemo:
common.compiletest:
compiletest:
test:
[junit] Testsuite: org.apache.lucene.search.TestFuzzyQuery
[junit] Tests run: 2, Failures: 0, Errors: 0, Time elapsed: 0.453 sec
BUILD SUCCESSFUL
Total time: 2 seconds
Does anyone have regression/performance test harnesses?
The unit tests were pretty minimal (which is a good thing!).
It'd be nice to test the min impl (ternary vs. if/then)
and the assumption about not allocating an
array of max distances. All told, the refactored version
should be a modest speed improvement, mainly from
unfolding the arrays so they're local onedimensional refs.
I don't know what the protocol is for oneoff contributions.
I'm happy with the Apache license, so that shouldn't
be a problem. I also don't know whether you use tabs
or spaces  I untabified the final version and used your
twospace format in emacs.
 Bob Carpenter
package org.apache.lucene.search;
/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import java.io.IOException;
/** Subclass of FilteredTermEnum for enumerating all terms that are similiar
* to the specified filter term.
*
* <p>Term enumerations are always ordered by Term.compareTo(). Each term in
* the enumeration is greater than all that precede it.
*/
public final class FuzzyTermEnum extends FilteredTermEnum {
/* This should be somewhere around the average long word.
* If it is longer, we waste time and space. If it is shorter, we waste a
* little bit of time growing the array as we encounter longer words.
*/
private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
/* Allows us save time required to create a new array
* everytime similarity is called. These are slices that
* will be reused during dynamic programming handoverhand
* style.
*/
private final int[] d0;
private final int[] d1;
private float similarity;
private boolean endEnum = false;
private Term searchTerm = null;
private final String field;
private final String text;
private final String prefix;
private final float minimumSimilarity;
private final float scale_factor;
/**
* Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term) throws IOException {
this(reader, term, FuzzyQuery.defaultMinSimilarity, FuzzyQuery.defaultPrefixLength);
}
/**
* Creates a FuzzyTermEnum with an empty prefix.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @param minSimilarity
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) throws IOException
{
this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength);
}
/**
* Constructor for enumeration of all terms from specified <code>reader</code>
which share a prefix of
* length <code>prefixLength</code> with <code>term</code> and which
have a fuzzy similarity >
* <code>minSimilarity</code>.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader Delivers terms.
* @param term Pattern term.
* @param minSimilarity Minimum required similarity for terms from the reader. Default value
is 0.5f.
* @param prefixLength Length of required common prefix. Default value is 0.
* @throws IOException
*/
public FuzzyTermEnum(IndexReader reader, Term term, final float minSimilarity, final int
prefixLength) throws IOException {
super();
if (minSimilarity >= 1.0f)
throw new IllegalArgumentException("minimumSimilarity cannot be greater than or equal
to 1");
else if (minSimilarity < 0.0f)
throw new IllegalArgumentException("minimumSimilarity cannot be less than 0");
if(prefixLength < 0)
throw new IllegalArgumentException("prefixLength cannot be less than 0");
this.minimumSimilarity = minSimilarity;
this.scale_factor = 1.0f / (1.0f  minimumSimilarity);
this.searchTerm = term;
this.field = searchTerm.field();
//The prefix could be longer than the word.
//It's kind of silly though. It means we must match the entire word.
final int fullSearchTermLength = searchTerm.text().length();
final int realPrefixLength = prefixLength > fullSearchTermLength ? fullSearchTermLength
: prefixLength;
this.text = searchTerm.text().substring(realPrefixLength);
this.prefix = searchTerm.text().substring(0, realPrefixLength);
this.d0 = new int[this.text.length()+1];
this.d1 = new int[this.d0.length];
setEnum(reader.terms(new Term(searchTerm.field(), prefix)));
}
/**
* The termCompare method in FuzzyTermEnum uses Levenshtein distance to
* calculate the distance between the given term and the comparing term.
*/
protected final boolean termCompare(Term term) {
if (field == term.field() && term.text().startsWith(prefix)) {
final String target = term.text().substring(prefix.length());
this.similarity = similarity(target);
return (similarity > minimumSimilarity);
}
endEnum = true;
return false;
}
public final float difference() {
return (float)((similarity  minimumSimilarity) * scale_factor);
}
public final boolean endEnum() {
return endEnum;
}
/******************************
* Compute Levenshtein distance
******************************/
/**
* Finds and returns the smallest of three integers
*/
private static final int min(int a, int b, int c) {
// removed assignments to use double ternary
return (a < b)
? ((a < c) ? a : c)
: ((b < c) ? b: c);
// alt form is:
// if (a < b) { if (a < c) return a; else return c; }
// if (b < c) return b; else return c;
}
/**
* <p>Similarity returns a number that is 1.0f or less (including negative numbers)
* based on how similar the Term is compared to a target term. It returns
* exactly 0.0f when
* <pre>
* editDistance < maximumEditDistance</pre>
* Otherwise it returns:
* <pre>
* 1  (editDistance / length)</pre>
* where length is the length of the shortest term (text or target) including a
* prefix that are identical and editDistance is the Levenshtein distance for
* the two words.</p>
*
* <p>Embedded within this algorithm is a failfast Levenshtein distance
* algorithm. The failfast algorithm differs from the standard Levenshtein
* distance algorithm in that it is aborted if it is discovered that the
* mimimum distance between the words is greater than some threshold.
*
* <p>To calculate the maximum distance threshold we use the following formula:
* <pre>
* (1  minimumSimilarity) * length</pre>
* where length is the shortest term including any prefix that is not part of the
* similarity comparision. This formula was derived by solving for what maximum value
* of distance returns false for the following statements:
* <pre>
* similarity = 1  ((float)distance / (float) (prefixLength + Math.min(textlen, targetlen)));
* return (similarity > minimumSimilarity);</pre>
* where distance is the Levenshtein distance for the two words.
* </p>
* <p>Levenshtein distance (also known as edit distance) is a measure of similiarity
* between two strings where the distance is measured as the number of character
* deletions, insertions or substitutions required to transform one string to
* the other string.
* @param target the target word or phrase
* @return the similarity, 0.0 or less indicates that it matches less than the required
* threshold and 1.0 indicates that the text and target are identical
*/
private synchronized final float similarity(final String target) {
final int m = target.length();
final int n = text.length();
if (n == 0) {
//we don't have anything to compare. That means if we just add
//the letters for m we get the new word
return prefix.length() == 0 ? 0.0f : 1.0f  ((float) m / prefix.length());
}
if (m == 0) {
return prefix.length() == 0 ? 0.0f : 1.0f  ((float) n / prefix.length());
}
final int maxDistance = calculateMaxDistance(m);
if (maxDistance < Math.abs(mn)) {
//just adding the characters of m to n or viceversa results in
//too many edits
//for example "pre" length is 3 and "prefixes" length is 8. We can see that
//given this optimal circumstance, the edit distance cannot be less than 5.
//which is 83 or more precisesly Math.abs(38).
//if our maximum edit distance is 4, then we can discard this word
//without looking at it.
return 0.0f;
}
int[] dLast = d0; // set locals for efficiency
int[] dCurrent = d1;
for (int j = 0; j <= n; j++) dCurrent[j] = j;
for (int i = 0; i < m; ) {
final char s_i = target.charAt(i);
int[] dTemp = dLast;
dLast = dCurrent; // previously: d[ii]
dCurrent = dTemp; // previously: d[i]
boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] is too large
for (int j = 0; j < n; j++) {
dCurrent[j+1] = (s_i == text.charAt(j))
? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j])
: min(dLast[j+1], dCurrent[j], dLast[j])+1;
if (prune && dCurrent[j+1] <= maxDistance)
prune = false;
}
// (prune==false) iff (dCurrent[j] < maxDistance) for some j
if (prune) {
return 0.0f;
}
}
// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shorter word.
// but this was the formula that was previously used in FuzzyTermEnum,
// so it has not been changed (even though minimumSimilarity must be
// greater than 0.0)
return 1.0F  dCurrent[n]/(float)(prefix.length() + Math.min(n,m));
}
private int calculateMaxDistance(int m) {
return (int) ((1minimumSimilarity) * (Math.min(text.length(), m) + prefix.length()));
}
/* This is redundant
public void close() throws IOException {
super.close(); //call super.close() and let the garbage collector do its work.
}
*/
}
package org.apache.lucene.search;
/**
* Copyright 2004 The Apache Software Foundation
*
* Licensed under the Apache License, Version 2.0 (the "License");
* you may not use this file except in compliance with the License.
* You may obtain a copy of the License at
*
* http://www.apache.org/licenses/LICENSE2.0
*
* Unless required by applicable law or agreed to in writing, software
* distributed under the License is distributed on an "AS IS" BASIS,
* WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
* See the License for the specific language governing permissions and
* limitations under the License.
*/
import org.apache.lucene.index.IndexReader;
import org.apache.lucene.index.Term;
import java.io.IOException;
/** Subclass of FilteredTermEnum for enumerating all terms that are similiar
* to the specified filter term.
*
* <p>Term enumerations are always ordered by Term.compareTo(). Each term in
* the enumeration is greater than all that precede it.
*/
public final class FuzzyTermEnum extends FilteredTermEnum {
/* This should be somewhere around the average long word.
* If it is longer, we waste time and space. If it is shorter, we waste a
* little bit of time growing the array as we encounter longer words.
*/
private static final int TYPICAL_LONGEST_WORD_IN_INDEX = 19;
/* Allows us save time required to create a new array
* everytime similarity is called. These are slices that
* will be reused during dynamic programming handoverhand
* style. They get resized, if necessary, by growDistanceArrays(int).
*/
private int[] d0;
private int[] d1;
private float similarity;
private boolean endEnum = false;
private Term searchTerm = null;
private final String field;
private final String text;
private final String prefix;
private final float minimumSimilarity;
private final float scale_factor;
/**
* Creates a FuzzyTermEnum with an empty prefix and a minSimilarity of 0.5f.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term) throws IOException {
this(reader, term, FuzzyQuery.defaultMinSimilarity, FuzzyQuery.defaultPrefixLength);
}
/**
* Creates a FuzzyTermEnum with an empty prefix.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader
* @param term
* @param minSimilarity
* @throws IOException
* @see #FuzzyTermEnum(IndexReader, Term, float, int)
*/
public FuzzyTermEnum(IndexReader reader, Term term, float minSimilarity) throws IOException
{
this(reader, term, minSimilarity, FuzzyQuery.defaultPrefixLength);
}
/**
* Constructor for enumeration of all terms from specified <code>reader</code>
which share a prefix of
* length <code>prefixLength</code> with <code>term</code> and which
have a fuzzy similarity >
* <code>minSimilarity</code>.
* <p>
* After calling the constructor the enumeration is already pointing to the first
* valid term if such a term exists.
*
* @param reader Delivers terms.
* @param term Pattern term.
* @param minSimilarity Minimum required similarity for terms from the reader. Default value
is 0.5f.
* @param prefixLength Length of required common prefix. Default value is 0.
* @throws IOException
*/
public FuzzyTermEnum(IndexReader reader, Term term, final float minSimilarity, final int
prefixLength) throws IOException {
super();
if (minSimilarity >= 1.0f)
throw new IllegalArgumentException("minimumSimilarity cannot be greater than or equal
to 1");
else if (minSimilarity < 0.0f)
throw new IllegalArgumentException("minimumSimilarity cannot be less than 0");
if(prefixLength < 0)
throw new IllegalArgumentException("prefixLength cannot be less than 0");
this.minimumSimilarity = minSimilarity;
this.scale_factor = 1.0f / (1.0f  minimumSimilarity);
this.searchTerm = term;
this.field = searchTerm.field();
//The prefix could be longer than the word.
//It's kind of silly though. It means we must match the entire word.
final int fullSearchTermLength = searchTerm.text().length();
final int realPrefixLength = prefixLength > fullSearchTermLength ? fullSearchTermLength
: prefixLength;
this.text = searchTerm.text().substring(realPrefixLength);
this.prefix = searchTerm.text().substring(0, realPrefixLength);
growDistanceArrays(TYPICAL_LONGEST_WORD_IN_INDEX);
setEnum(reader.terms(new Term(searchTerm.field(), prefix)));
}
/**
* The termCompare method in FuzzyTermEnum uses Levenshtein distance to
* calculate the distance between the given term and the comparing term.
*/
protected final boolean termCompare(Term term) {
if (field == term.field() && term.text().startsWith(prefix)) {
final String target = term.text().substring(prefix.length());
this.similarity = similarity(target);
return (similarity > minimumSimilarity);
}
endEnum = true;
return false;
}
public final float difference() {
return (float)((similarity  minimumSimilarity) * scale_factor);
}
public final boolean endEnum() {
return endEnum;
}
/******************************
* Compute Levenshtein distance
******************************/
/**
* Finds and returns the smallest of three integers
*/
private static final int min(int a, int b, int c) {
// removed assignments to use double ternary
return (a < b)
? ((a < c) ? a : c)
: ((b < c) ? b: c);
// alt form is:
// if (a < b) { if (a < c) return a; else return c; }
// if (b < c) return b; else return c;
}
/**
* <p>Similarity returns a number that is 1.0f or less (including negative numbers)
* based on how similar the Term is compared to a target term. It returns
* exactly 0.0f when
* <pre>
* editDistance < maximumEditDistance</pre>
* Otherwise it returns:
* <pre>
* 1  (editDistance / length)</pre>
* where length is the length of the shortest term (text or target) including a
* prefix that are identical and editDistance is the Levenshtein distance for
* the two words.</p>
*
* <p>Embedded within this algorithm is a failfast Levenshtein distance
* algorithm. The failfast algorithm differs from the standard Levenshtein
* distance algorithm in that it is aborted if it is discovered that the
* mimimum distance between the words is greater than some threshold.
*
* <p>To calculate the maximum distance threshold we use the following formula:
* <pre>
* (1  minimumSimilarity) * length</pre>
* where length is the shortest term including any prefix that is not part of the
* similarity comparision. This formula was derived by solving for what maximum value
* of distance returns false for the following statements:
* <pre>
* similarity = 1  ((float)distance / (float) (prefixLength + Math.min(textlen, targetlen)));
* return (similarity > minimumSimilarity);</pre>
* where distance is the Levenshtein distance for the two words.
* </p>
* <p>Levenshtein distance (also known as edit distance) is a measure of similiarity
* between two strings where the distance is measured as the number of character
* deletions, insertions or substitutions required to transform one string to
* the other string.
* @param target the target word or phrase
* @return the similarity, 0.0 or less indicates that it matches less than the required
* threshold and 1.0 indicates that the text and target are identical
*/
private synchronized final float similarity(final String target) {
final int m = target.length();
final int n = text.length();
if (n == 0) {
//we don't have anything to compare. That means if we just add
//the letters for m we get the new word
return prefix.length() == 0 ? 0.0f : 1.0f  ((float) m / prefix.length());
}
if (m == 0) {
return prefix.length() == 0 ? 0.0f : 1.0f  ((float) n / prefix.length());
}
final int maxDistance = calculateMaxDistance(m);
if (maxDistance < Math.abs(mn)) {
//just adding the characters of m to n or viceversa results in
//too many edits
//for example "pre" length is 3 and "prefixes" length is 8. We can see that
//given this optimal circumstance, the edit distance cannot be less than 5.
//which is 83 or more precisesly Math.abs(38).
//if our maximum edit distance is 4, then we can discard this word
//without looking at it.
return 0.0f;
}
//let's make sure we have enough room in our array to do the distance calculations.
if (d0.length <= m) {
growDistanceArrays(m);
}
int[] dLast = d0; // set local vars for efficiency ~ the old d[i1]
int[] dCurrent = d1; // ~ the old d[i]
for (int j = 0; j <= m; j++) dCurrent[j] = j;
for (int i = 0; i < n; ) {
final char s_i = text.charAt(i);
int[] dTemp = dLast;
dLast = dCurrent; // previously: d[ii]
dCurrent = dTemp; // previously: d[i]
boolean prune = (dCurrent[0] = ++i) > maxDistance; // true if d[i][0] is too large
for (int j = 0; j < m; j++) {
dCurrent[j+1] = (s_i == target.charAt(j))
? min(dLast[j+1]+1, dCurrent[j]+1, dLast[j])
: min(dLast[j+1], dCurrent[j], dLast[j])+1;
if (prune && dCurrent[j+1] <= maxDistance)
prune = false;
}
// (prune==false) iff (dCurrent[j] < maxDistance) for some j
if (prune) {
return 0.0f;
}
}
// this will return less than 0.0 when the edit distance is
// greater than the number of characters in the shorter word.
// but this was the formula that was previously used in FuzzyTermEnum,
// so it has not been changed (even though minimumSimilarity must be
// greater than 0.0)
return 1.0F  dCurrent[m]/(float)(prefix.length() + Math.min(n,m));
}
/**
* Grow the second dimension of the array slices, so that we can
* calculate the Levenshtein difference.
*/
private void growDistanceArrays(int m) {
d0 = new int[m+1];
d1 = new int[m+1];
}
private int calculateMaxDistance(int m) {
return (int) ((1minimumSimilarity) * (Math.min(text.length(), m) + prefix.length()));
}
/* This is redundant
public void close() throws IOException {
super.close(); //call super.close() and let the garbage collector do its work.
}
*/
}

This message is automatically generated by JIRA.

If you think it was sent incorrectly contact one of the administrators: http://issues.apache.org/jira/secure/Administrators.jspa

For more information on JIRA, see: http://www.atlassian.com/software/jira

To unsubscribe, email: javadevunsubscribe@lucene.apache.org
For additional commands, email: javadevhelp@lucene.apache.org
