lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Preventing "killer" queries
Date Wed, 08 Feb 2006 18:38:01 GMT

: Chris, although I suggested it initially, I'm now a
: little uncomfortable in controlling this issue with a
: static variable in TermQuery because it doesnt let me
: have different settings for different queries, indexes
: or fields.

Oh i totally agree ... it's the kind of thing you'd only want to turn on
as a last result.  your analogy to BooleanQuery.maxClauseCount was dead on
-- but i think the default usage would probably be revesed.
maxClauseCount is set by default to protect people who may not realize how
big their Wildcard/Prefix/Range queries can get ... more expert users can
disable it once they've made their app smart enough to protect them.

having a maxDocFreq in TermQuery is the kind of thing you'd probably want
to leave off by default, and only when expert users say "holy crap, if
the term is in more then 75% of the docs, i really don't care about it's
scoring.

If you wanted more granular control over it, that could be done by an
external "QueryVisitor" that walks the tree of BooleanQueries (after
rewriting) and when it finds a TermQuery with a field/term/docCount
triplet it doesnt' like replaces it with a ConstantScore query.

: I like to think of the ideal solution as a control
: which automatically identifies and tunes out what it
: sees as stop words but is controllable on a per index,
: per field and per query basis, if needs be.

see, my only concern with your approach is that you think of these common
words as stop words that can be completely ignored.  if the docCount ==
numDocs then i agree with you (except in the case where the intent was a
MatchAllDocs type query -- but that's a really special case).  I'm
concerned about the idea of treating terms as stop words just becuase they
are really common.   If I'm indexing a bunch of documents at MegaCorp and
i say i want to treat any word that appears more then 90% of the time as a stop
word, so that i don't have to worry about things like people searching for
the word "is" or "MegaCorp" killing performacne, i still have the problme
that when people search for "MegaCorp" trying to find only the docs that
mention it (or worse, search for "confidential -MegaCorp" to find docs
docs MegaCorp shouldn't have access too) and the search doesn't return the
results they want at all because hte analyzer has thrown out the word
that's most importantto them.

: search time for a range of tested DFs. However, both
: filter and query response times increase in a linear
: fashion with increases in df so I suspect they are
: both ultimately heading for trouble as data volumes

I don't know that you can avoid that -- it has to iterate over the docs
that match to count them so the more docs there are hat match the longer
it takes.

: I did come across a bizarre anomaly I would be
: interested to have explained. A RangeFilter based on a
: single term with 50% df responds in the same time as a
: RangeFilter on a different field for a term with the
: same df.
: When it comes to TermQuerys though, not all fields are
: equal. Using a TermQuery on a "free text" field with
: many values for a single term with 50% df takes half
: the time of a TermQuery on a constrained field
: ("doctype") for a single term with similar df. The
: doctype field only ever has one of 6 possible values.
: Both queries are on the same index, and similar df
: values. The relative performance difference was the
: same for other DFs I tested across the 2 fields.
: What is going on here? If anything, I might have
: expected the open-ended field to be slower.

I only have a hunch -- have you looked at the distribution of docids for
the two terms? ... it's possible that your "doctypes" are evenly
distributed causing the termDocs iterating to span the entire list of
documents, while the high DF term from your free text field may be
clustered in a particular range of doc ids, letting the termDocs iterating
skip over large chunks (maybe even whole segments).

I imagine this would be very likely if you indexed your docs in
chronological order and the high df term you were looking at was a
"recent" concept ... i think you mentioned country names in your orriginal
example right? if you are indexing a list of cities with their countries,
odds are the orriginal data store had them grouped by country so if you
iterate over that when building your index your doc ids will be similarly
clustered.

like i said ... just a hunch.


-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message