jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject RE: Re: DescendantSelfAxisWeight ChildAxisQuery performance
Date Fri, 30 Nov 2007 09:24:41 GMT

> Ard Schrijvers wrote:
> 
> > Query q = qm.createQuery("stuff//*[@count]", Query.XPATH); if (q 
> > instanceof QueryImpl) {
> >     // limit the result set
> >     ((QueryImpl) q).setLimit(1);
> > }
> > 
> > Since my "stuff//*[@count]" gives me 1.200.000, it makes 
> perfect sense 
> > to users I think, that even with our patches and a working 
> cache, that 
> > retaining them all would be slow. But if I set the limit to 
> 1 or 10, I 
> > would expect to have performance (certainly when you have not 
> > implemented any AccessManager).
> > 
> > But, if I set limit to 1, why would we have to check all 1.200.000 
> > parents wether the path is correct?
> 
> I'm not quite sure if this is a valid/common use case. I 
> can't imagine doing a query like this without using an "order 
> by" clause. Because without an "order by" you will just get a 
> random node. But if you use an "order by" you need to get all 
> nodes first anyway.

This is not my point. Wether you have an order by or not, lucene will
compute the score of all hits anyway. So, no order by, does not mean
that lucene does not order: it orders on score (but ofcourse you already
know that :-) )
So, my thing holds with and without order by. 

> 
> > If I get a sorted hits by lucene (only on the "//*[@count]" part 
> > (perhaps with an order by as well), so without the initial path), I 
> > would want to start with the first one, and check the 
> parent, then the 
> > second, etc, untill I have a hit that is correct according 
> its path. 
> > If I have a limit of 10, we would need to get 10 successes.
> > Obviously, in the worst case scenario, we would still have to check 
> > every hit for its parents, but this would be rather exceptional i 
> > think.
> 
> Ok, I see. You would like to check parent-child relations 
> lazily? Well this has to drawbacks I think:

Yes, I certainly was aware of this already :-) (already implied the
difficulties)
 
> 1) The total result size will be very inaccurate until you 
> fetched the whole result set. Even now it might be inaccurate 
> because of AccessManager checks but doing lazy parent-child 
> relation check will make it almost unusable.

You might warn that fetching a total result size is slow. Without having
to know the total, it should not have to be slow.

> 2) DescendantSelfAxisQueries and ChildAxisQueries are not 
> only used as a final selector but can also be used inside a 
> query like this:
> 
> 	stuff//*[@bar='text' and @foo/count]
> 
> You probably can't calculate @foo/count lazyily.

@foo/count should probably be foo/@count isn't? I haven't yet used
DescendantSelfAxisQueries and ChildAxisQueries in these kind of queries,
but I see your point

> 
> > and I have > 1.000.000 hits, and I have to wait, even in the cached 
> > version, a few seconds, but changing "stuff//*[@count]" into 
> > "//*[@count]" reduces it to a couple of ms, that does not 
> make sense.
> 
> I know what you are talking about. That's why I don't use any 
> hierarchical queries at all. My queries all look like:
> 
> 	//element(*, nt:specific-node-type)[@count]
> 
> So I'm distinguishing my nodes only by node type or sometimes 
> mixins instead of by paths.

I already understood that (aware) people are using it like this (but
what about the unaware people). But, suppose I have articles in
different languages, with different initials paths, and I want the 10
lastmodified from some language. It doesn't make sense that I need to
make articles for every language a different nodetype, because of
DescendantSelfAxisQueries and ChildAxisQueries.

> I would really love to optimize Jackrabbits search to make 
> the two searches you mentioned above perform equally. You 
> would even expect the
>   second one to be faster because it already reduces the 
> number of potentially matching nodes.

Well, lucene would indeed be faster with the second, if it had 'paths'
in its index, but we really don't want that because we cannot easily
move nodes anymore. 

> But I don't think the "lazy" solution will work. WDOT?

I also have the idea that it will at least be extremely hard, *but*, I
also wanted to emphasize that if we just look at the problem from a
birds eye view, we must agree that checking all parent paths doesn't
really make sense in some cases (certainly when the number of hits is
very large)

Anyway, perhaps we just have to think a little harder. Not everything
has to be simple :-) 

Regards Ard

> 
> Cheers,
> Christoph
> 
> 

Mime
View raw message