lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Earwin Burrfoot <ear...@gmail.com>
Subject Re: inconsistency/performance trap of empty terms
Date Sat, 30 Oct 2010 11:19:01 GMT
I'd say support them everywhere, and slip LengthFilter into all the
standard Analyzers, so people won't hit empty terms unless they opt-in
for it.
This is a most consistent approach.

On Sat, Oct 30, 2010 at 15:06, Robert Muir <rcmuir@gmail.com> wrote:
> On Sat, Oct 30, 2010 at 7:01 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
>> Mathematically an inverted index is keyed by strings. Any strings.
>> Empty term is just a case of a string of length 0.
>> So, for consistency, Lucene should support them. TermsEnum.seek("")
>> should position you into very beginning of terms list, etc.
>> If you drop the support, you have to check zero length damn
>> eeeeverywhere in the API where you accept terms. Or, thoroughly
>> document unpredictable erratic behaviour :)
>
> well, we are checking this already, in a lot of the analyzers.
>
> as i said originally, the biggest problems that we *must* solve are:
> 1. try to prevent the performance trap i mentioned, where people
> create the empty term as a mega-stopword without realizing it.
> 2. fix the analyzers to be consistent with regards to the empty
> term... for example, if we decide the empty term is supported, then
> they shouldnt be arbitrarily removing empty-term tokens.
>
> as far as TermsEnum, i myself have already had to special-case the
> empty term in TermsEnum implementations before... and I'm pretty
> fucking sure that we have long-standing bugs if you have an empty-term
> anywhere in your index (e.g. FuzzyQuery will divide by 0 to scale the
> boost, and you will get a strange exception from your collector
> because it will then have NaN/Inf/some sentinel value).
>
> just saying, its problematic today, doing nothing and leaving it the
> messy unambiguous situation it is now is no option.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
> For additional commands, e-mail: dev-help@lucene.apache.org
>
>



-- 
Kirill Zakharenko/Кирилл Захаренко (earwin@gmail.com)
Phone: +7 (495) 683-567-4
ICQ: 104465785

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message