jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ard Schrijvers (JIRA)" <j...@apache.org>
Subject [jira] Commented: (JCR-1213) UUIDDocId cache does not work properly because of weakReferences in combination with new instance for combined indexreader
Date Wed, 28 Nov 2007 23:37:43 GMT

    [ https://issues.apache.org/jira/browse/JCR-1213?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12546483
] 

Ard Schrijvers commented on JCR-1213:
-------------------------------------

I have tested all patches. All three patches seem to have similar performance improvements
(compared to the former code they are all *very* much faster in the cached version). ATM I
am reshuffle the indexes to have more  parent lookups in other indexes, and then will really
see if all have the same performance. It might boil down to which patch has the nicest solution.
Mine with the WeakReferences is obviously outdated, and can be discarded. 

I think you have to decide which one you want to choose. I think Christoph's patch uses a
little more memory because it keeps MultiIndexReaderDoc objects in memory, but, instead of
keeping the MultiIndexReaderDoc  in memory, you could store the int and long in the DocUUID
(just like instead of a UUID isntance we store msb and lsb). Though, we are talking only about
the object overhead (so 1 million MultiIndexReaderDoc  would imply some 12 Mb extra memory,
not really shocking)


But still...during the test, having 1.200.000 nodes in the repository, I realized we are still
doing something 'irrational'. It won't be easy to implement I think, because it also depends/involves
wether people have implemented an AccessManager, but if I have the following test:

Query q = qm.createQuery("stuff//*[@count]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);
}

Since my "stuff//*[@count]" gives me 1.200.000, it makes perfect sense to users I think, that
even with our patches and a working cache, that retaining them all would be slow. But if I
set the limit to 1 or 10, I would expect to have performance (certainly when you have not
implemented any AccessManager).

But, if I set limit to 1, why would we have to check all 1.200.000 parents wether the path
is correct? 

If I get a sorted hits by lucene, I would want to start with the first one, and check the
parent, then the second, etc, untill I have a hit that is correct according its path. If I
have a limit of 10, we would need to get 10 successes. Obviously, in the worst case scenario,
we would still have to check every hit for its parents, but this would be rather exceptional
i think.

Ofcourse, when people have a custom AccessManager impl, you only know after the access manager
wether the hit was a real hit. But when having 

Query q = qm.createQuery("stuff//*[@count]", Query.XPATH);
if (q instanceof QueryImpl) {
    // limit the result set
    ((QueryImpl) q).setLimit(1);
}

and I have > 1.000.000 hits, and I have to wait, even in the cached version, a few seconds,
but changing "stuff//*[@count]" into "//*[@count]" reduces it to a couple of ms, that does
not make sense. 

I think we should consider wether we could do the DescendantSelfAxisQuery or ChildAxisQuery
as some sort of lazy filter. In the end, when users want to also  have the total hits for
"stuff//*[@count]", we obviously are still facing a slow query. WDOT?  This though obviously
might belong to a new jira issue, or to the existing one about the DescendantSelfAxisQuery
 and ChildAxisQuery  performance.


> UUIDDocId cache does not work properly because of weakReferences in combination with
new instance for combined indexreader 
> ---------------------------------------------------------------------------------------------------------------------------
>
>                 Key: JCR-1213
>                 URL: https://issues.apache.org/jira/browse/JCR-1213
>             Project: Jackrabbit
>          Issue Type: Improvement
>          Components: query
>    Affects Versions: 1.3.3
>            Reporter: Ard Schrijvers
>             Fix For: 1.4
>
>         Attachments:  JCR-1213-ckiehl.txt, JCR-1213-mreutegg.patch, JCR-1213.patch, JCR1213Test.java
>
>
> Queries that use ChildAxisQuery or DescendantSelfAxisQuery make use of getParent() functions
to know wether the parents are correct and if the result is allowed. The getParent() is called
recursively for every hit, and can become very expensive. Hence, in DocId.UUIDDocId, the parents
are cached. 
> Currently,  docId.UUIDDocId's are cached by having a WeakRefence to the CombinedIndexReader,
but, this CombinedIndexReader is recreated all the time, implying that a gc() is allowed to
remove the 'expensive' cache.
> A much better solution is to not have a weakReference to the CombinedIndexReader, but
to a reference of each indexreader segment. This means, that in getParent(int n) in SearchIndex
the return 
> return id.getDocumentNumber(this) needs to be replaced by return id.getDocumentNumber(subReaders[i]);
and something similar in CachingMultiReader. 
> That is all. Obviously, when a node/property is added/removed/changed, some parts of
the cached DocId.UUIDDocId will be invalid, but mainly small indexes are updated frequently,
which obviously are less expensive to recompute.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message