lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Muir <rcm...@gmail.com>
Subject inconsistency/performance trap of empty terms
Date Wed, 27 Oct 2010 10:29:11 GMT
Hello,

Mike and I were discussing some very unrelated stuff and the question
of how to handle the empty term came up...

I started thinking about this email:
http://www.lucidimagination.com/search/document/a8d3a8647e581a5b/patternreplacefilterfactory_creating_empty_string_as_a_term#f43d167b91c2ba07

So, looking through the analyzers, I think we should make a decision
about what to do with empty terms.
In my opinion there is a performance trap here, that might work like this:

1. a user, particularly say a solr user is using a combination of
tokenizers/filters and ends out with the "empty term" as basically a
mega-stopword, like what happened to that user.
2. due to this, their queries have terrible performance, especially if
they are 'auto-generating phrase queries' (the solr default)
3. but, its not possible that anyone can really even rely upon the
analyzers handling empty terms correctly, because we are so
inconsistent about it.

Just taking a quick glance through the analyzers, i noticed each one
seems to have willy-nilly code/TODO's regarding this empty term.
for example, the n-gramish tokenizers such as CJKTokenizer,
CommonGramsFilter, NGramTokenizer, etc explicitly avoid creating
these.

But there are inconsistencies:
TrimFilter explicitly creates/maintains empty terms.
NGramFilter doesnt seem to have this check, but the NGramTokenizer does.
PatternFilter documents it might create empty terms, but the
PatternTokenizer avoids them.
I am sure some of the stemmers probably create empty terms in some
situations (eg maybe it removes -alization suffix, but has no length
check, and if the term is "alization" it makes empty terms)

Anyway, I think its possible other users might be in this same
situation, with slow performance, and not even realizing it yet...
Obviously they can fix this if they go and add LengthFilter, but
should we be doing something different?

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org


Mime
View raw message