jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers" <a.schrijv...@hippo.nl>
Subject DescendantSelfAxisWeight ChildAxisQuery performance
Date Thu, 29 Nov 2007 09:03:51 GMT


Jukka pointed out that generic discussions should be on dev and not in JIRA issues, so I'll
repeat here my last comment from JCR-1213. Also related is JCR-1196 [Queries for DescendantSelfAxisWeight/ChildAxisQuery
are currently very heavy and become slow pretty quickly].

Background: Marcel, Chirstoph and I have been working to fix the cache of DescendantSelfAxisWeight/ChildAxisQuery
regarding hierarchy resolver. This cache seems to be fixed now (not yet in trunk), but during
tests, I realized some things that seem unlogical to me (I'll now copy-paste from JCR-1213)

During the tests (with the fixed hierarchy cache), having 1.200.000 nodes in the repository,
I realized we are still doing something 'irrational'. It won't be easy to implement I think,
because it also depends/involves wether people have implemented an AccessManager, but if I
have the following test:

Query q = qm.createQuery("stuff//*[@count]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);

Since my "stuff//*[@count]" gives me 1.200.000, it makes perfect sense to users I think, that
even with our patches and a working cache, that retaining them all would be slow. But if I
set the limit to 1 or 10, I would expect to have performance (certainly when you have not
implemented any AccessManager).

But, if I set limit to 1, why would we have to check all 1.200.000 parents wether the path
is correct?

If I get a sorted hits by lucene (only on the "//*[@count]" part (perhaps with an order by
as well), so without the initial path), I would want to start with the first one, and check
the parent, then the second, etc, untill I have a hit that is correct according its path.
If I have a limit of 10, we would need to get 10 successes. Obviously, in the worst case scenario,
we would still have to check every hit for its parents, but this would be rather exceptional
i think.

Ofcourse, when people have a custom AccessManager impl, you only know after the access manager
wether the hit was a real hit. But when having

Query q = qm.createQuery("stuff//*[@count]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);

and I have > 1.000.000 hits, and I have to wait, even in the cached version, a few seconds,
but changing "stuff//*[@count]" into "//*[@count]" reduces it to a couple of ms, that does
not make sense.

I think we should consider wether we could do the DescendantSelfAxisQuery or ChildAxisQuery
as some sort of lazy filter. In the end, when users want to also have the total hits for "stuff//*[@count]",
we obviously are still facing a slow query. WDOT? This though obviously might belong to a
new jira issue, or to the existing one about the DescendantSelfAxisQuery and ChildAxisQuery



Regards Ard

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message