lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: FuzzyQuery prefix length
Date Wed, 20 Oct 2004 16:14:35 GMT
Daniel Naber wrote:
> On Tuesday 12 October 2004 17:22, Doug Cutting wrote:
>>Which is worse: a person who searches for Photokopie~ in a 1000 document
>>collection does not find documents containing Fotokopie; or a person who
>>searches for Photokopie~ in a 1M document collection doesn't find
>>anything because it takes too long.  I think some relevant results are
>>better than none.
> I disagree, as the user who doesn't get the "Fotokopie" matches will not 
> understand what's going on. He will assume that there are no such 
> documents, which is wrong. If there's a timeout the user will at least 
> notice something is wrong. Besides that, it's the developers 
> responsibility to get things fast enough. If he decides to do so with a 
> prefix that might be okay for his use case. 

This is clearly not a black-and-white issue.  Can other Lucene 
developers please offer their opinions?

The question is whether the QueryParser should, by default, require a 
one-or-two character prefix match for fuzzy terms, or a zero-character 
prefix, as it does today.

The advantages of a zero-character prefix default are that it's 
back-compatibile and that it will find more matches, when spelling 
differences are in the first characters.

The disadvantage of a zero-character prefix default is that it performs 
poorly for large collections, requring perhaps around 10 seconds for 
multi-million document collections, considerably slower than any other 
type of query supported by the QueryParser.

Similarly, the advantage of a one-or-two-character prefix default is 
that it will perform much better with larger collections.  And the 
disadvantage is that it is an incompatible change, and it will miss some 
matches, those where the spelling differences are in the first characters.

Developers may always change this by calling 
QueryParser.setFuzzyPrefixLength().  So at issue is which behaviour is 
better for developers who do not know of this parameter.  Is it more 
important that their applications perform well or that they find all 
matches to fuzzy queries?

Please offer your opinion and thoughts on this.



To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message