lucene-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ype Kingma <>
Subject Re: Query optimizer - Cost of Queries
Date Thu, 25 Mar 2004 22:37:08 GMT

I'm afraid I didn't understand your post fully. Nevertheless,
did you consider adding prefix terms (in a separate field) as normal terms
to your index?
Eg. suppose your terms are nrs ranging 0000 to 9999 you
could search the range 0250-0302 by prefixes indexed as terms:
025 026 027 028 029 0300 0301 0302
instead of all 53 terms separately, probably saving quite a few
disk head seeks for the range query.

How are the ranges and the spans related?

Kind regards,

On Thursday 25 March 2004 18:47, Jochen Frey wrote:
> Hi There!
> We are in the process of building a query optimizer for Lucene RangeQueries
> (we need that because we run fairly complex Range queries with a few
> hundred terms against large corpuses, and response time needs improvement).
> We have written a framework that allows for traversing queries and
> rearranging / recreating subqueries.
> In a next step, we tried to find criteria to optimize. A Simple one is to
> reduce the total number of terms in the query.
> Question 1: Is it a good idea to minimize the # of terms.
> Some optimization options however leave the choice of which term to reduce.
> In order to make that choice we are using a fairly simple cost estimator
> for queries and terms (currently we only deal with SpanNearQuery,
> SpanOrQuery and SpanTermQuery)
> SpanNearQuery: 10 - #of clauses + total of the cost of all clauses
> SpanOrQuery: 10 + total of the cost of all clauses
> SpanTermQuery: 1 over #of characters in the term
> Question 2: Does anyone have better cost estimates or comments about this?
> This optimization is all happening client side (i.e. as of the writing of
> this, the optimizer does not know the statistics for tokens actually stored
> in the index).
> Question 3: How do I get access to Term frequencies (i.e. the number of
> times a given Term appears in the index). I assume that the way to go is
> getTermFreqVectors in IndexWriter. This should allow for better choices as
> to which term to eliminate.
> Question 4: What are good cost estimates assuming that we have term
> frequencies available?

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message