lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <cutt...@apache.org>
Subject Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search FuzzyQuery.java FuzzyTermEnum.java
Date Fri, 17 Sep 2004 18:12:01 GMT
Christoph Goller wrote:
> Doug Cutting wrote:
>> It might also be good if one could set the non-fuzzy prefix length 
>> used by the QueryParser.  As it stands, fuzzy queries with large 
>> indexes that use QueryParser are so slow they're unusable.  But a 
>> default prefix of just a couple of characters would make a huge 
>> performance improvement.
> 
> I will think about extending QueryParser as you proposed (should not be 
> too difficult, we only have to find a reasonable syntax)

I don't think we need to extend the query syntax, just add a method:

    QueryParser.setFuzzyPrefixLength(int)

and have the default value be 2.  This is not back-compatible, but I 
think it makes fuzzy queries *much* more usable, and is hence worth the 
incompatibility.

>> Another idea might be to, rather than (or in addition to) limiting the 
>> number of expanded terms by similarity, to limit them by number.  So 
>> one could keep, e.g., just the top-scoring 100 terms whose score is 
>> greater than 0.5, or somesuch.  This way FuzzyQuery would never 
>> trigger BooleanQuery.TooManyClauses.  What do you think?
> 
> Also sounds reasonable. Of course it does not solve the efficiency problem
> of rewriting a FuzzyQuery. Do you think the expensive part is going through
> all terms of a field or is it the Levenstein-computation, or both?

Both.  Simply enumerating all of the terms is pretty fast for small 
collections (<1M documents), but adding a Levenstein computation for all 
words makes fuzzy search prohibitively slow.  But the Levenstein 
computation will not be too expensive if it doesn't need to be computed 
for so many terms.  On average, a two-character prefix should reduce the 
number of Levenstein computations by a factor of several hundred, no?

> I hope you like my extensions to PraseQuery and PhrasePrefixQuery too :-)

Yes, very much!  I like all of your changes!  Thanks!

The QueryParser feature that would leverage this best might be to make 
asterisk (*) in a phrase increment the position.  Thus a search for 
""John * Kennedy" would match "John Fitzgerald Kennedy", in case you'd 
forgotten his middle name.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: lucene-dev-unsubscribe@jakarta.apache.org
For additional commands, e-mail: lucene-dev-help@jakarta.apache.org


Mime
View raw message