lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject Re: inconsistency/performance trap of empty terms
Date Sat, 30 Oct 2010 11:06:54 GMT
On Sat, Oct 30, 2010 at 7:01 AM, Earwin Burrfoot <earwin@gmail.com> wrote:
> Mathematically an inverted index is keyed by strings. Any strings.
> Empty term is just a case of a string of length 0.
> So, for consistency, Lucene should support them. TermsEnum.seek("")
> should position you into very beginning of terms list, etc.
> If you drop the support, you have to check zero length damn
> eeeeverywhere in the API where you accept terms. Or, thoroughly
> document unpredictable erratic behaviour :)

well, we are checking this already, in a lot of the analyzers.

as i said originally, the biggest problems that we *must* solve are:
1. try to prevent the performance trap i mentioned, where people
create the empty term as a mega-stopword without realizing it.
2. fix the analyzers to be consistent with regards to the empty
term... for example, if we decide the empty term is supported, then
they shouldnt be arbitrarily removing empty-term tokens.

as far as TermsEnum, i myself have already had to special-case the
empty term in TermsEnum implementations before... and I'm pretty
fucking sure that we have long-standing bugs if you have an empty-term
anywhere in your index (e.g. FuzzyQuery will divide by 0 to scale the
boost, and you will get a strange exception from your collector
because it will then have NaN/Inf/some sentinel value).

just saying, its problematic today, doing nothing and leaving it the
messy unambiguous situation it is now is no option.

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message