lucene-java-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Uwe Schindler" <...@thetaphi.de>
Subject RE: TermRangeQuery performance oddness
Date Tue, 07 May 2013 07:13:52 GMT
Hi,

The problem is by design: Lucene is an inverted index, so lookups can only be done by single
terms and find the documents related to every single term. To execute a range, the query first
have to position the terms enum on the first term and then iterate over all *terms* in the
index (not documents) until the last term is reached. If the number of terms in the field
is large (because you have many distinct values), this takes some time. For every term in
the enumeration that matches the range, Lucene has to look up all matching documents in the
posting list and report them as hits (using a bitset). The latter (looking up the posting
lists involves lots of work), so ranges with thousands of terms will get slow.

So the time depends: How many terms are in your term dictionary between the lower bound and
the higher bound of your range, not really the size of the index (although this is quite often
directly related).

If you want faster range queries, use maybe NumericRangeQuery, because this has some optimizations
on the cost of a large index size. But if you are stuck with text, you may also review FieldCacheRangeFilter
(which only works for untokenized fields, but I assume from your example "title" is not tokenized).

The order of results of a range query is in "index order", because there is no TF-IDF ranking
involved (all hits have the same score of 1). Index order means the order in which they were
indexed.

-----
Uwe Schindler
H.-H.-Meier-Allee 63, D-28213 Bremen
http://www.thetaphi.de
eMail: uwe@thetaphi.de


> -----Original Message-----
> From: Aleksey [mailto:bittercold@gmail.com]
> Sent: Tuesday, May 07, 2013 4:15 AM
> To: java-user@lucene.apache.org
> Subject: TermRangeQuery performance oddness
> 
> Hi guys,
> 
> If I run 2 term range queries:
> 
> new TermRangeQuery("title", new BytesRef("A"), null, true, true); and new
> TermRangeQuery("title", new BytesRef("Z"), null, true, true);
> 
> The one that starts with "Z" is several times faster (I make 1000 queries in a
> loop to measure). I understand that the first one has much larger hit number,
> but if the query is bounded to 50 results, why does that matter?
> At first I thought that it grabs all hits and sorts them, but then it doesn't seem
> to make any difference whether or not I pass sort by "title" parameter to the
> searcher. Results are either sorted or kind of random, but speed is the same.
> Why is that?
> 
> Thank you in advance,
> 
> Aleksey
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Mime
View raw message