Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Date: Thu, 4 Apr 2013 10:15:15 +0000 (UTC)
From: "Lukas Eder (JIRA)" <jira@apache.org>
To: dev@jackrabbit.apache.org
Message-ID: <JIRA.12640734.1365068428016.106011.1365070515235@arcas>
In-Reply-To: <JIRA.12640734.1365068428016@arcas>
References: <JIRA.12640734.1365068428016@arcas>
Subject: [jira] [Commented] (JCR-3557) Inefficient deletion of ChangeLog
 objects by AbstractBundlePersistenceManager
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/JCR-3557?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13621996#comment-13621996 ] 

Lukas Eder commented on JCR-3557:
---------------------------------

... note, I'm aware of some implementations (including the one from the screenshot) using NODE_ID_HI and NODE_ID_LO, which makes using the "IN predicate" a bit more tricky. Some databases support tuples, e.g.

    (NODE_ID_HI, NODE_ID_LO) IN ((?, ?), (?, ?), (..., ...), ...)

Others would have to resort to "expanding" the tuples:

    (NODE_ID_HI = ? AND NODE_ID_LO = ?) OR (NODE_ID_HI = ? AND NODE_ID_LO = ?) OR (...)

Besides, there are lots of other limits to consider:

- Oracle's IN predicate can only contain 1000 elements
- SQL Server and Sybase have a limit of around 2000 bind variables per query

See also:
http://stackoverflow.com/q/11266609/521799
                
> Inefficient deletion of ChangeLog objects by AbstractBundlePersistenceManager
> -----------------------------------------------------------------------------
>
>                 Key: JCR-3557
>                 URL: https://issues.apache.org/jira/browse/JCR-3557
>             Project: Jackrabbit Content Repository
>          Issue Type: Bug
>          Components: jackrabbit-core
>    Affects Versions: 2.6
>            Reporter: Lukas Eder
>            Priority: Minor
>         Attachments: Yourkit_Screenshot_SQL.png, Yourkit_Screenshot_Stacktrace.png
>
>
> Deleting a tree of nodes in Jackrabbit results in a prohibitive amount of SQL queries being executed when using the BundleDbPersistenceManager. Please consider the attached Screenshots from Yourkit Profiler, which illustrate how saving the deletion of a single node through the JCR API results in 34 SQL DELETE statements for my test case:
> - Yourkit_Screenshot_Stacktrace.png: The stack trace from a single deletion. Please disregard the absolute values of the time column. It is biased by the profiling session.
> - Yourkit_Screenshot_SQL.png: An extract of the 34 SQL statements issued by the selected stack subtree
> While other operations are probably not very optimal either, this one is a very low hanging fruit, in my opinion. Consider the following piece of code in AbstractBundlePersistenceManager.storeInternal():
> {code}
> for (ItemState state : changeLog.deletedStates()) {
>     if (state.isNode()) {
>         NodePropBundle bundle = getBundle((NodeId) state.getId());
>         if (bundle == null) {
>             throw new NoSuchItemStateException(state.getId().toString());
>         }
>         deleteBundle(bundle);
>         deleted.add(state.getId());
>     }
> }
> {code}
> Now, instead of iterating over deletedStates and loading / deleting them one by one, they should be bulk-loaded / bulk-deleted as such:
> {code}
> List<NodeId> nodeIds = new ArrayList<NodeId>();
> for (ItemState state : changeLog.deletedStates()) {
>     if (state.isNode()) {
>         nodeIds.add((NodeId) state.getId());
>     }
> }
> List<NodePropBundle> bundles = new ArrayList<NodePropBundle>();
> bundles.addAll(getBundles(nodeIds)); // Can throw NoSuchItemStateException
> deleteBundles(bundles);
> deleted.addAll(nodeIds);
> {code}
> For backwards-compatibility and convenience, AbstractBundlePersistenceManager would provide trivial default implementations for getBundles() and deleteBundles():
> {code}
> protected List<NodePropBundle> getBundles(Collection<? extends NodeId> nodeIds) throws ItemStateException {
>     List<NodePropBundle> result = new ArrayList<NodePropBundle>();
>     for (NodeId nodeId : nodeIds) {
>         result.add(getBundle(nodeId));
>     }
>     return result;
> }
> protected void deleteBundles(Collection<? extends NodePropBundle> bundles) throws ItemStateException {
>     for (NodePropBundle bundle : bundles) {
>         deleteBundle(bundle);
>     }
> }
> {code}
> But the BundleDbPersistenceManager could override this default behaviour by executing something like
> {code}
> -- getBundles()
> select NODE_ID, BUNDLE_DATA from BUNDLE where NODE_ID in (?, ?, ...)
> -- deleteBundles()
> delete from BUNDLE where NODE_ID in (?, ?, ...)
> {code}
> The same also applies for deleting from the REFS table.
> If this seems like a viable improvement and if I didn't overlook something subtle where one-by-one selection / deletion of bundle data is important, I could go ahead and provide a patch for this issue. Once such an improvement is in place, other bundle persistence managers might take advantage of the new algorithm as well. Also, we could review adding / updating bundles in another JIRA ticket.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira