jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Query performances
Date Wed, 28 Mar 2007 09:08:08 GMT
Hi Alessandro,

Alessandro Bologna wrote:
> Now I have found another unusual behavior, and I was hoping you could
> explain this too...
> These queries have been executed in sequence (without restarting):
> Executing query: /jcr:root/load/n10/n33/*[@random>10000]
> Query execution time:10245ms
> Number of nodes:91
> Executing query: /jcr:root/load/n10/n33/*[@random>10000 and 
> @random<10000000]
> Query execution time:20409ms
> Number of nodes:91
> Executing query: /jcr:root/load/n10/n33/*[@random>10000 and
> @random<10000000 and @random<10000001]
> Query execution time:30053ms
> Number of nodes:91
> I think that the execution time on the first query is already quite high 
> (an
> equality query takes just a few millisecond),

This has already been improved with http://issues.apache.org/jira/browse/JCR-804

> but what I am more
> disconcerted about is that the second query (with two condition, the second
> being a 'dummy' one since it is true for each of the 91 nodes returned by
> the second query) takes double the time, and the third query (with the 
> third
> condition being basically the same as the first one) takes three times as
> much.
> Typically I would expect an 'and' query to be executed on the results of 
> the
> first one, and therefore to take just a little bit less.
> So the questions are:
> 1. why does it takes so long to find 91 nodes in the first query

this is caused by:
- MultiTermDocs is expensive on large value ranges (-> fixed in JCR-804)
- @random>10000 (probably) selects a great number of nodes, which are later 
excluded again because of the path constraint

> 2. why the second and third query take as much time as the first times the
> number of expressions?

each of the expressions is evaluated independently and in a second step 'and'ed 
together. therefore the predominant cost in your query seems to be the 
individual expressions. because each of the range expressions selects a lot of 
nodes lucene cannot optimize the execution well. see above for a workaround.

> 3. is there a workaround to do range queries?

partitioning the random property into multiple properties may help. the basic 
idea is that you split the random number into a sum of multiple values.

@random = 34045

would become:

@random1 = 5
@random10 = 4
@random100 = 0
@random1000 = 4
@random10000 = 3

later if you search for all random properties with a value larger than 12000 you 
would have a query:
//*[(@random10000 = 1 and @random1000 >= 2) or (random10000 >= 2)]

because the distinct values of the split up properties are small, lucene can 
much better optimize the query execution.


View raw message