lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From David Spencer <>
Subject Re: cvs commit: jakarta-lucene/src/java/org/apache/lucene/search
Date Fri, 17 Sep 2004 20:39:43 GMT
Doug Cutting wrote:

> Christoph Goller wrote:
>> Doug Cutting wrote:
>>> It might also be good if one could set the non-fuzzy prefix length 
>>> used by the QueryParser.  As it stands, fuzzy queries with large 
>>> indexes that use QueryParser are so slow they're unusable.  But a 
>>> default prefix of just a couple of characters would make a huge 
>>> performance improvement.
>> I will think about extending QueryParser as you proposed (should not 
>> be too difficult, we only have to find a reasonable syntax)
> I don't think we need to extend the query syntax, just add a method:
>    QueryParser.setFuzzyPrefixLength(int)
> and have the default value be 2.  This is not back-compatible, but I 
> think it makes fuzzy queries *much* more usable, and is hence worth the 
> incompatibility.
>>> Another idea might be to, rather than (or in addition to) limiting 
>>> the number of expanded terms by similarity, to limit them by number.  
>>> So one could keep, e.g., just the top-scoring 100 terms whose score 
>>> is greater than 0.5, or somesuch.  This way FuzzyQuery would never 
>>> trigger BooleanQuery.TooManyClauses.  What do you think?

Replying to the above (TooManyClauses) - if not being done, perhaps it 
could not throw this and instead just catch it and happily return the 
current, incomplete, query as a best effort kind of thing.

>> Also sounds reasonable. Of course it does not solve the efficiency 
>> problem
>> of rewriting a FuzzyQuery. Do you think the expensive part is going 
>> through
>> all terms of a field or is it the Levenstein-computation, or both?
> Both.  Simply enumerating all of the terms is pretty fast for small 
> collections (<1M documents), but adding a Levenstein computation for all 

I guess if you Lev distance is a parameter then you could check the diff 
of the string lengths first, and if they differ by more than that param 
you skip the Lev calculation.

> words makes fuzzy search prohibitively slow.  But the Levenstein 
> computation will not be too expensive if it doesn't need to be computed 
> for so many terms.  On average, a two-character prefix should reduce the 
> number of Levenstein computations by a factor of several hundred, no?
>> I hope you like my extensions to PraseQuery and PhrasePrefixQuery too :-)
> Yes, very much!  I like all of your changes!  Thanks!
> The QueryParser feature that would leverage this best might be to make 
> asterisk (*) in a phrase increment the position.  Thus a search for 
> ""John * Kennedy" would match "John Fitzgerald Kennedy", in case you'd 
> forgotten his middle name.
> Doug
> ---------------------------------------------------------------------
> To unsubscribe, e-mail:
> For additional commands, e-mail:

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message