lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Hostetter <hossman_luc...@fucit.org>
Subject Re: Preventing "killer" queries
Date Wed, 08 Feb 2006 01:00:07 GMT

Mark,

I know you've already commited a patch along these lines (LUCENE-494) and
I can see how in a lot of cases that would be a great solution, but i'm
still interested in the orriginal idea you proposed (a 'maxDf' in
TermQuery) because i anticipate situations in which you don't want to
ignore the common term as query time (because you want it to affect the
result set), you just don't want to bother spending a lot of time
calculating it's score contribution since it's so common -- perhaps even
if an optimization can get the time down, you don't want it included
because it's so common.

If i understand your description of the problem, in your profiling you've
confirmed that when a term is extremely common, the "tf" portion of the
calculation for each doc is expensive because of the underlying call to
TermDocs.read(int[],int[]) ... is that correct?

If that's the case, then it seems like a fairly straightforward and useful
patch would be to add the following (untested) to TermQuery...

    private static int maxDocFreq = Integer.MAX_INT;
    private static float macDocFreqRawScore = 0.0f;
    public static setMaxDocFreqScore(int df, float rawScore) {
        maxDocFreq = df;
        rawScore = macDocFreqRawScore;
    }
    public rewrite(IndexReader reader) {
       if (maxDocFreq < reader.docFreq(term)) {
          // should be ConstantScoreTermQuery but it doesn't exist
          Query q= new ConstantScoreRangeQuery(term.field(),term.text(),term.text(),true,true)
          q.setBoost(macDocFreqRawScore);
          return q.rewrite(reader);
       }
       return this;
    }


...the downside compared to your existing approach is that it's still
spending some time on the really common terms (build up the filter) so if
you truely wantto ignore them the analyzer is a better way to go -- but
the upside is that it would still allow those really common terms to
affect the result set.


   thoughts?



: Date: Tue, 07 Feb 2006 20:18:27 +0000
: From: markharw00d <markharw00d@yahoo.co.uk>
: Reply-To: java-dev@lucene.apache.org
: To: java-dev@lucene.apache.org
: Subject: Re: Preventing "killer" queries
:
: [Answering my own question]
:
: I think a reasonable solution is to have a generic analyzer for use at
: query-time that can wrap my application's choice of analyzer and
: automatically filter out what it sees as stop words. It would initialize
: itself from an IndexReader and create a StopFilter for those terms
: greater than a given document frequency.
:
: This approach seems reasonable because:
: a) The stop word filter is automatically adaptive and doesn't need
: manual tuning.
: b) I can live with the disk space overhead of the few "killer" terms
: which will make it into the index.
: c) "Silent" failure (ie removal of terms from query) is probably
: generally preferable to the throw-an-exception approach taken by
: BooleanQuery if clause limits are exceeded.
:
:
:
:
:
:
:
:
: ___________________________________________________________
: To help you stay safe and secure online, we've developed the all new Yahoo! Security Centre.
http://uk.security.yahoo.com
:
: ---------------------------------------------------------------------
: To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
: For additional commands, e-mail: java-dev-help@lucene.apache.org
:



-Hoss


---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org


Mime
View raw message