lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Doug Cutting <>
Subject Re: Query optimizer - Cost of Queries
Date Thu, 25 Mar 2004 22:52:31 GMT
Jochen Frey wrote:
> We are in the process of building a query optimizer for Lucene RangeQueries
> (we need that because we run fairly complex Range queries with a few hundred
> terms against large corpuses, and response time needs improvement). We have
> written a framework that allows for traversing queries and rearranging /
> recreating subqueries.
> In a next step, we tried to find criteria to optimize. A Simple one is to
> reduce the total number of terms in the query. 
> Question 1: Is it a good idea to minimize the # of terms.

Yes.  A RangeQuery expands into a BooleanQuery of TermQuerys.

The general cost of executing a TermQuery is something like:

    k_tq + j_tq * IndexReader.docFreq(t)

Where t is the term, and k_tq and j_tq are constants that need to be 
determined.  k_tq mostly represents the cost of looking the term up in 
the dictionary, and j_tq represents the cost of scoring each document.

The cost of a BooleanQuery is something like

    sum ( cost(c) +  k_bq )

Where c is each clause in the query and k_bq is a constant.

You could perform some timing experiments to determine approximate 
values of these constants.

> Some optimization options however leave the choice of which term to reduce.
> In order to make that choice we are using a fairly simple cost estimator for
> queries and terms (currently we only deal with SpanNearQuery, SpanOrQuery
> and SpanTermQuery)
> SpanNearQuery: 10 - #of clauses + total of the cost of all clauses

A worst-case cost function for SpanNearQuery is something like:

   sum (k_snq * cost(c) * log(|q|) )

where |q| is the number of terms in the query.  Things should be better 
than this, however, if the skipTo() optimization fires.  This is 
effective when one or more of the terms occurs less than 1/16th as many 
documents as another, and can provide up-to a 16-fold speedup.

> SpanOrQuery: 10 + total of the cost of all clauses 

   sum ( k_soq * cost(c) * log(|q|) )

But here the skipTo() optimization is not effective.

> SpanTermQuery: 1 over #of characters in the term 

    k_stq + j_stq * sum (TermDocs(t).freq())

> Question 3: How do I get access to Term frequencies (i.e. the number of
> times a given Term appears in the index).


> Question 4: What are good cost estimates assuming that we have term
> frequencies available?

See above.

> And yes, if all of this ends up working we'll make the code available to the
> project.

Great!  I look forward to seeing it!


To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message