lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Goller <gol...@detego-software.de>
Subject Re: [PATCH] FuzzyTermEnum optimization and refactor
Date Sat, 23 Oct 2004 13:57:46 GMT
Jonathan Hager schrieb:
> Since fuzzy searching is kind of slow, I took a look at it to see if
> it could be improved.  I saw speed improvements of 10% - 60% by making
> a couple changes.  Along the way I fixed a potential bug or two that I
> saw.
> 
> The patch is here:
> http://www.hagerfamily.com/patches/FuzzyTermEnumOptimizePatch.txt
> 
> I've never submitted a patch before, so don't flame me if I do or say
> anything wrong.

Thank you for the patch. Everything you say sounds very reasonable.
A committer will certainly look into your patch. Please put it into
bugzilla so that it does not get lost. Also include the explanation
from your mail.

> 
> What Changed?
> 
> Since the word was discarded if the edit distance for the word was
> above a certain threshold, I updated the distance algorithm to abort
> if at any time during the calculation it is determined that the best
> possible outcome of the edit distance algorithm is above this
> threshold.  The source code has a great explanation.
> 
> I also reduced the amount of floating point math, reduced the amount
> of potential space the array takes in its first dimension, removed the
> potential divide by 0 error when one term is an empty string, and
> fixed a bug where an IllegalArgumentException was thrown if the class
> was somehow initialized wrong, instead of looking at the arguments.
> 
> The behavior is almost identical.  The exception is that similarity is
> set to 0.0 when it is guaranteed to be below the minimum similarity.
> 
> Results
> 
> I saw the biggest improvement from longer words, which makes a sense. 
> My long word was "bridgetown" and I saw a 60% improvement on this. 
> The biggest improvement are for words that are farthest away from the
> median length of the words in the index.  Short words (1-3 characters)
> saw a 30% improvement.  Medium words saw a 10% improvement (5-7
> characters).  These improvements are with the prefix set to 0.
> 
> Would someone be willing to validate that they see similar results? 
> Especially on large indexes.
> 
> Other Questions
> 
> I am still wondering about two things.  1) Why can't minimumSimilarity
> be less than 0?  Similarity may be negative, especially for small
> words.  2) Why does the formula for similarity not return a number
> between 0.0 (no common characters) and 1.0 (identical) inclusive? 
> This would be easy to do, just use Math.max() instead of Math.min(). 
> Of course this would change the order of the results.
> 
> As an example the similarity for
>   "to" and "todor" =  1 - 3/2 = -0.5
>    "tod" and "for"  = 1 - 2/3 = +0.33

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message