jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Unico Hommes (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (JCR-3263) Consistency checker performance improvements
Date Tue, 27 Mar 2012 07:41:32 GMT

    [ https://issues.apache.org/jira/browse/JCR-3263?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239284#comment-13239284
] 

Unico Hommes commented on JCR-3263:
-----------------------------------

For the consistency checker the method you propose would not make a lot of sense IMO. To use
it you would have to do something like:

Collection<NodeId> nodeIds = pm.getAllNodeIds(after, maxcount);
Map<NodeId, NodeInfo> pm.getNodeInfos(nodeIds);

Which underneath would result in two database calls to be made:
SELECT NODE_ID FROM WS_BUNDLE WHERE NODE_ID > ${after} etc.
and 
SELECT NODE_ID, BUNDLE_DATA FROM WS_BUNDLE WHERE NODE_ID IN (?,?,? etc.

At least in the case of the consistency checker I don't think it makes a lot of sense to do
it this way.

The reason the patch is so large is that it includes patches for 4 other issues. It will be
some work to create a patch for this issue alone but if you need it I would gladly provide
it.

                
> Consistency checker performance improvements
> --------------------------------------------
>
>                 Key: JCR-3263
>                 URL: https://issues.apache.org/jira/browse/JCR-3263
>             Project: Jackrabbit Content Repository
>          Issue Type: Improvement
>            Reporter: Unico Hommes
>         Attachments: checkerperformance.patch
>
>
> Currently the consistency checker loads in a batch of node ids and for each node id fetches
the corresponding bundle, its child bundles, and parent bundle separately. This makes the
consistency checker perform less than optimal and may take hours (days?) to complete for large
repositories.
> I've been able to make the checker execute about 20 times faster on my local machine
by loading in batches of node prop bundles at once. For 17000 nodes in the workspace the current
implementation ran for about 23 seconds whereas with the enhancements I made it finished in
1.2 seconds.
> Now the problem lies in the fact that loading in node prop bundles in batches may require
a lot of memory. And it is not very predictable how much per batch size because the sizes
of the individual bundles are unpredictable.
> Also the node prop bundle contains much more information than is needed for a consistency
check.
> What would be ideal in this situation is to introduce a new type - call it NodeInfo -
that contains only the structural information the checker needs to do its work. Meaning the
node id, the parent id and the child ids. In order to allow for a possible future referential
integrity check perhaps also its reference type propeties.
> The IterablePersistenceManager interface would then get an additional method:
> Map<NodeId, NodeInfo> getAllNodeInfos();
> If this is an acceptable proposal I would like to work on this and contribute a patch.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message