jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Christoph Kiehl ...@sulu3000.de>
Subject Re: DescendantSelfAxisWeight ChildAxisQuery performance
Date Fri, 30 Nov 2007 08:40:05 GMT
Ard Schrijvers wrote:

> Query q = qm.createQuery("stuff//*[@count]", Query.XPATH);
> if (q instanceof QueryImpl) {
>     // limit the result set
>     ((QueryImpl) q).setLimit(1);
> }
> Since my "stuff//*[@count]" gives me 1.200.000, it makes perfect
> sense to users I think, that even with our patches and a working
> cache, that retaining them all would be slow. But if I set the limit
> to 1 or 10, I would expect to have performance (certainly when you
> have not implemented any AccessManager).
> But, if I set limit to 1, why would we have to check all 1.200.000
> parents wether the path is correct?

I'm not quite sure if this is a valid/common use case. I can't imagine 
doing a query like this without using an "order by" clause. Because 
without an "order by" you will just get a random node. But if you use an 
"order by" you need to get all nodes first anyway.

> If I get a sorted hits by lucene (only on the "//*[@count]" part
> (perhaps with an order by as well), so without the initial path), I
> would want to start with the first one, and check the parent, then
> the second, etc, untill I have a hit that is correct according its
> path. If I have a limit of 10, we would need to get 10 successes.
> Obviously, in the worst case scenario, we would still have to check
> every hit for its parents, but this would be rather exceptional i
> think.

Ok, I see. You would like to check parent-child relations lazily? Well 
this has to drawbacks I think:

1) The total result size will be very inaccurate until you fetched the 
whole result set. Even now it might be inaccurate because of 
AccessManager checks but doing lazy parent-child relation check will 
make it almost unusable.
2) DescendantSelfAxisQueries and ChildAxisQueries are not only used as a 
final selector but can also be used inside a query like this:

	stuff//*[@bar='text' and @foo/count]

You probably can't calculate @foo/count lazyily.

> and I have > 1.000.000 hits, and I have to wait, even in the cached
> version, a few seconds, but changing "stuff//*[@count]" into
> "//*[@count]" reduces it to a couple of ms, that does not make sense.

I know what you are talking about. That's why I don't use any 
hierarchical queries at all. My queries all look like:

	//element(*, nt:specific-node-type)[@count]

So I'm distinguishing my nodes only by node type or sometimes mixins 
instead of by paths.
I would really love to optimize Jackrabbits search to make the two 
searches you mentioned above perform equally. You would even expect the 
  second one to be faster because it already reduces the number of 
potentially matching nodes.
But I don't think the "lazy" solution will work. WDOT?


View raw message