jackrabbit-users mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Marcel Reutegger <marcel.reuteg...@gmx.net>
Subject Re: Query performances
Date Mon, 19 Mar 2007 09:39:53 GMT
Hi Alessandro,

Alessandro Bologna wrote:
> We have been incurring in an interesting behavior doing searches on a quite
> large repository (~1,000,000 nodes).
> The test data is made of a tree of nodes of type nt:unstructured, reference
> able, with two numeric properties (a sequential count of the node and a
> random number between 0 and the count). Each node has a reference to the
> parent, and up to 100 child nodes, and is named n<m> where m is the 
> index of
> the node, related to the parent node.
> So, for instance, /load/n0 is the first node, /load/n1 the second to
> /load/n99.
> Then each one of them has 100 children and so on, so that a valid path, for
> instance, is /load/n23/n34/n50.
> One node out of 6 has attached a nt:file node as well, in order to test 
> full
> text searches. If requested, I can provide the code to create the test set.
> The strange behavior that prompted me to write to this mailing list, is the
> following:
> Say that I am searching for a node that contains the word 'beatles' at some
> level under the node /load/n40 and I use the following query:
> */jcr:root/load/n40//*[jcr:contains(.,'beatles')]* the execution time is
> 1672ms
> If I use instead:
> */jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')]*  the execution time
> is 19749ms
> The second query, in theory, could execute faster than the first, because I
> am providing more information (only nodes at the 4th level under /load/n40)
> but takes 10 times longer to execute.
> Is there a reason why?

there are basically two reasons why the second query takes more time to execute:
- the index does not contain depth (level) information of a node. the depth of a 
node is not stable and may change even if the node itself is not changed. if a 
subtree of nodes is moved to another location the depth of all nodes in the 
subtree changes. the query handler would have to re-index the whole subtree.
- multiple child axis with just a * as name test are not optimized. 
/jcr:root/load/n40/*/*/*/*[jcr:contains(.,'beatles')] is translated into 
multiple ChildAxisQuerys each resolves the context nodes and provides a new 
context with the nodes that are the child nodes of the previous context. 
internally the query handler will temporarily have a set of nodes that includes 
all nodes at level 4 under /jcr:root/load/n40. for the query 
/jcr:root/load/n40//*[jcr:contains(.,'beatles')] the index will look up the 
nodes that match the fulltext condition and then filter out the ones that do not 
have /jcr:root/load/n40 as an ancestor. that operation involves less nodes and 
executes faster.

> The other, way more worrisome problem, appears to be the opposite:
> I have executed the following two queries
> /jcr:root/load/n50/n2/* ==> 931ms
> /jcr:root/load/n50/n2/*/* ==> 661ms

that's indeed strange. maybe you get this result because the cache is filled up 
by the first query and the second one can take advantage of the pre-filled 
cache. can you please run those queries a second time just to make sure that 
both run against the same cache state?

> The first is returning all nodes one level below /load/n50/n2 and the 
> second
> two levels below. There are no other nodes under that.
> When I tried the following query, which would return the same nodes in one
> operation, the result was surprising (in a bad way)
> /jcr:root/load/n50/n2//* ==>*353769ms*
> **
> The CPU goes 100%, I see in the jackrabbit logs a lot of entries similar 
> to:
> DocNumberCache: size=1024/1024, #accesses=17039, #hits=167, #misses=16872,
> cacheRatio=1% (DocNumberCache.java, line 155)
> and then finally, *some 5 minutes later*, I get the result.
> Even if I restrict the query, it still takes the same time:
> /jcr:root/load/n50/n2/m96//* and there's maybe only an hundred nodes under
> that.

unfortunately those are queries that are not optimized at all and will result in 
a full index traversal. see below for a workaround.

> I have the exact same behavior if I try with the SQL syntax: select * from
> nt:base where jcr:path like '/load/n50/n2/n96/%'

that's because this query is equivalent to the above XPath statement and if 
that's the case the lucene query, which is executed ultimately, is the same.

> The version of JR is 1.2.2. The backend is Oracle 10g, and I am running the
> application on Tomcat 5.5 with jdk 1.5 and 1GB assigned to the JVM (on
> Windows)
> Does anybody have any idea on why is this happening and if there is a
> workaround?

I thought no one would actually execute such queries and didn't bother to 
optimize them because there's a simple workaround:

Node rootNode = ...
rootNode.getNode("load/n50/n2").accept(new TraversingItemVisitor.Default() {
             protected void entering(Node node, int level)
                                  throws RepositoryException {

If you feel this should be done efficiently by the query handler please file a 
jira issue. Thanks.


View raw message