Return-Path: X-Original-To: apmail-jackrabbit-dev-archive@www.apache.org Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C1049749 for ; Mon, 6 Feb 2012 13:47:13 +0000 (UTC) Received: (qmail 58768 invoked by uid 500); 6 Feb 2012 13:47:13 -0000 Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org Received: (qmail 58731 invoked by uid 500); 6 Feb 2012 13:47:12 -0000 Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@jackrabbit.apache.org Delivered-To: mailing list dev@jackrabbit.apache.org Received: (qmail 58724 invoked by uid 99); 6 Feb 2012 13:47:12 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 13:47:12 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of stefan.guggisberg@gmail.com designates 209.85.210.170 as permitted sender) Received: from [209.85.210.170] (HELO mail-iy0-f170.google.com) (209.85.210.170) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 13:47:07 +0000 Received: by iakk32 with SMTP id k32so17613482iak.1 for ; Mon, 06 Feb 2012 05:46:46 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; bh=2JfMnLMfdAT+okiMxrZtJYg3wNmJZFG0a0m+eVySemg=; b=qVcXCIeQIdYC5R7lQtxf8ynk2TNvui6Odyg/IbY8hpMyEm7U7zoLM6Ik0aRk/tCVSO 2LmiocYr68ME2bVv3X7NizvnpMvp/ULnElelTQaCkDhNO18bGW9bNWgjx32VaX+qCs9a fBETXFykrRFZYFuScamhHzcKNR/ev1B892gzM= MIME-Version: 1.0 Received: by 10.50.180.233 with SMTP id dr9mr10454313igc.11.1328536006843; Mon, 06 Feb 2012 05:46:46 -0800 (PST) Received: by 10.42.147.197 with HTTP; Mon, 6 Feb 2012 05:46:46 -0800 (PST) In-Reply-To: <4F2EBBA3.2080802@apache.org> References: <4F2C2B0E.7050508@apache.org> <4F2EBBA3.2080802@apache.org> Date: Mon, 6 Feb 2012 14:46:46 +0100 Message-ID: Subject: Re: [jr3 microkernel] Change log consolidation From: Stefan Guggisberg To: dev@jackrabbit.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable On Sun, Feb 5, 2012 at 6:25 PM, Michael D=FCrig wrote: > > > On 4.2.12 20:07, Stefan Guggisberg wrote: >> >> On Fri, Feb 3, 2012 at 7:44 PM, Michael D=FCrig =A0w= rote: >>> >>> >>> Hi, >>> >>> While working on a new transient space implementation [1] I discovered >>> that >>> recreating a list of changes from transient modifications to the >>> hierarchy >>> is like opening a can of worms. Rethinking things it turned out, that i= t >>> is >>> much simpler to directly consolidate the change log. This approach is >>> much >>> more flexible and versatile since knowledge of the persisted hierarchy = is >>> not necessary at all. Thus this approach should work equally well with >>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI. >>> >>> The algorithm is capable of creating minimal change logs from any valid >>> change log (that is from any change log that can be applied to some >>> hierarchy). >>> >>> I extensively tested the algorithm on randomly created change logs [3]. >>> This >>> resulted in some interesting numbers: consolidation results in an >>> decrease >>> of about 45% of these randomly generated change logs. Furthermore the >>> resulting size of the H2 database files *decreased up to 200%*! >>> >>> The implementation [2] is currently part of the jackrabbit-microkernel >>> module in the sandbox. However it does not have any crucial dependencie= s >>> so >>> it should be easy to adapt for other usages. >>> >>> The gory details of the algorithm are based on some algebraic propertie= s >>> of >>> move operations on paths as detailed in the following paragraphs. >>> >>> The change log consolidation algorithm implemented in the reduce method >>> [2] >>> is based on algebraic properties of move operations on paths. The other >>> operations (add node, remove node and set property) are generalized to >>> move >>> operations. Consolidation relies on reduction and commutation rules of >>> move >>> operations. >>> >>> A move operation resembles a map on a hierarchy (of nodes and >>> properties). A >>> change log consisting of k move operations m_1 to m_k is thus the >>> composition of the individual moves: m_1 *, ..., * m_k (using * for >>> function >>> composition: f(g(x)) =3D (f * g)(x)). >>> >>> Some definitions, notation and propositions: >>> >>> * Let NIL denote a path which never occurs in any hierarchy. >>> >>> * Order on paths: let p, q be paths. >>> =A0- p< =A0q iff p !=3D NIL and q !=3D NIL and p is an ancestor of q. >>> =A0- p<=3D q iff p< =A0q or p =3D=3D q !=3D NIL >>> >>> * Conflict of paths: let p, q be paths. >>> =A0- p ~ q (p conflicts with q) iff either p<=3D q or q<=3D p >>> >>> * Substitution in paths: let p, q, r be paths. >>> =A0- [p -> =A0q]r =3D r if p is not an ancestor of r and >>> =A0- [p -> =A0q]r =3D s where s is the path resulting from replacing >>> =A0 =A0the ancestor p in r with q otherwise. >>> >>> * Let p, q be paths. Then p:q denotes a move operation where the >>> =A0node at p is moved to a new node at q. >>> >>> * Valid moves: leq p, q be paths. >>> =A0- p:q is valid iff p !~ q or p =3D q >>> =A0- if p:q is not valid, it is invalid >>> >>> Invalid moves are exactly those which would result in a node being move= d >>> to >>> an ancestor/descendant of itself. >>> >>> * Identity on moves: let p, q be paths. >>> =A0- p:q =3D ID iff p =3D q. >>> >>> * Conflict on moves: let p, q, r, s be paths. >>> =A0- p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s >>> =A0 =A0or q ~ r or q ~ s. >>> >>> * Strict commutativity of moves: let m, n be moves. >>> =A0- m * n =3D n * m iff m !~ n >>> >>> * Substitutions in moves: let p, q, r, s be paths. >>> =A0- [p -> =A0q]r:s =3D [p -> =A0q]r:[p -> =A0q]s >>> >>> * Let p be a path and let +p denote an add node operation and let -p >>> =A0denote a remove node operation for a node at path p. >>> =A0- +p =3D NIL:p That is, adding a node is represented by a move from = a >>> =A0 =A0unknown source. >>> =A0- p =3D p:NIL. That is, removing a node is represented by a move to = an >>> =A0 =A0unknown sink. >>> >>> >>> Let m =3D p:q, n =3D r:s with p, q, r, s !=3D NIL be valid moves with m= !=3D ID >>> and >>> n !=3D ID. Then the following reduction and commutation rules apply: >>> >>> 1. =A0p!~ r: =A0m * n =3D n * m >>> 2. =A0p< =A0r: =A0illegal (since this implies q<=3D r which implies p ~= q and >>> =A0 =A0 =A0 =A0 =A0 =A0thus m invalid) >>> 3. =A0p =3D r: =A0illegal (since this implies q<=3D r which implies p ~= q and >>> =A0 =A0 =A0 =A0 =A0 =A0this m invalid) >>> 4. =A0p> =A0r: =A0does not commute if q< =A0s. Otherwise m * n =3D n * = [r -> =A0s]m >>> 5. =A0p!~ s: =A0m * n =3D n * m >>> 6. =A0p< =A0s: =A0illegal (since this implies p ~ q and thus m invalid) >>> 7. =A0p =3D s: =A0does not commute >>> 8. =A0p> =A0s: =A0illegal (since p> =A0s implies there is an s already = which >>> =A0 =A0 =A0 =A0 =A0 =A0will conflict with r:s) >>> 9. =A0q!~ r: =A0m * n =3D n * m >>> 10. q< =A0r: =A0m * n =3D [q -> =A0p]n * m >>> 11. q =3D r: =A0m * n =3D p:s (transitivity of moves) >>> 12. q> =A0r: =A0m * n =3D n * [r -> =A0s]m >>> 13. q!~ s: =A0m * n =3D n * m >>> 14. q< =A0s: =A0does not commute if p> =A0r. Otherwise m * n =3D [q -> = =A0p]n * m >>> 15. q =3D s: =A0illegal (since s conflicts with r:s) >>> 16. q> =A0s: =A0illegal (since s conflicts with r:s) >>> >>> Allowing add node and remove node operations the following additional >>> conditions apply: >>> >>> Let m =3D p:q, n =3D r:s be valid moves with m !=3D ID and n !=3D ID. T= hen the >>> reduction and commutations rules 1. to 16. apply with extra conditions = on >>> 4., 10., 12. and 14.: >>> >>> 4'. =A0if s =3D NIL and q =3D NIL then m * n =3D -r. Otherwise if s =3D= NIL then >>> =A0 =A0 m, n do not commute. >>> 10'. illegal if p =3D NIL >>> 12'. if s =3D NIL then m * n =3D -r * -p >>> 14'. illegal if p =3D NIL >> >> >> sounds very complicated ;) i admit that i don't >> fully understand it (ok, i didn't try very hard >> and gave up somewhere in the middle). > > > It is actually much simpler than it looks. Have a look at the code and th= e > inline comments for additional hints and my explanations below. > > >> could you please provide a simple example? > > > What I basically did is identify all cases of how moves, adds and removes > interact. This turned out to be much simpler when conceptually handling a= dds > as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and > then specialising interaction rules for adds and deletes from there. > > One kind of interaction is reductions like > >>/a:/b, >/b:/c =3D >/a:/c > or > -/a, -/a/b =3D -/a > ... > > The other kind of interaction is commutation like > >>/a:/b, >/c:/d =3D >/c:/d, >/a:/b > or >> /a:/x, +/x/y:{} =3D +/a/y:{}, >/a:/x > ... > > With a complete list of such interactions (see 1. - 16. in my initial pos= t) > it is possible to reduce a change log to its minimal equivalent by > subsequent application of reductions and commutations: Given a reduced > change list and a new operation inserted at index k, try to reduce it wit= h > the operation at k - 1. If this fails commute it with the operation at k = - 1 > and repeat at k - 1 until either a reduction has been found or commutatio= n > fails. In the latter case do the same going forward (to k + 1). If still = no > reduction could be found, we are done and the change list is minimized. > Otherwise, recursively apply the reduction algorithm treating the result = of > the reduction as the new operation inserted into the rest of the change > list. > This is exactly what is implemented in reduce(List, int index)= . > See > http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/sr= c/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=3Dmarkup > > Example: >>/a:/x +/x/y:{} >/x:/b =3D =A0(commutation of 1 and 2, rule 14.) > +/a/y:{} >/a:/x >/x:/b =3D =A0(reduction of 2 and 3, rule 11.) > +/a/y:{} >/a:/b > > Some key observations which make reduce() work: > - The interaction rules do not depend on the persisted hierarchy: If a > change log can be applied to a hierarchy, its minimized equivalent can be > applied to the same hierarchy. More over the observable effect on that > hierarchy will be the same. > > - The reduce algorithm is minimizing since whether a reduction can occur = or > not does not depend on 'where it occurs'. Above example could as well red= uce > as follows: > >>/a:/x +/x/y:{} >/x:/b =3D =A0(commutation of 2 and 3, rule 12.) >>/a:/x >/x:/b +/b/y =3D =A0 =A0 (reduction of 1 and 2, rule 11.) >>/a:/b +/b/y:{} > > >> what is the benefit of this approach compared >> to the imlementation in jackrabbit-core >> (which does consolidate transient changes)? > > > - Versatility: it is not bound to any implementation, nor does it need th= e > actual hierarchy (i.e. it is stateless). It is basically a realisation of= a > map from change logs to change logs. ok, agreed. > > - Self contained: everything in one single quite small class. No > dependencies, no clutter. > > - Much simpler. erm, i beg to differ ... ;) > > - Provably sound and complete. > > - Provably minimizing on the change logs. > > - Portable since it doesn't depend on a specific underlying implementatio= n. > > - Reduction of up to 45% of the change log communicated from the upper > layers to the Microkernel. great! however, to be fair, you're comparing 2 different change log-based transient space implementations in the sandbox. transient space implementations don't necessarily need to be change log-bas= ed (see e.g. jackrabbit-core). i assume you would see little (if not no) reduc= tion in the number of persisted changes per save operation when comparing to the jackrabbit-core transient space implementation. OTOH, i am not saying that the jackrabbit-core approach is suitable when using a microkernel-base= d backend and i am not questioning the usefulness of consolidated change logs in general. > > - In the case of H2, reduction of up to 200% of database size. see my previous comment > > Apart from the benefits, it is a plain necessity for transient space > implementations on top of the Microkernel: without proper consolidation > users could end up not being able to save their (valid) transient > modifications. Consider a first user doing > >>/a:/t/x +/t/x/y:{} >/t/x:/b > > if in the meanwhile another user removes x the save operation for the fir= st > user would fail. how could another user remove /t/x? it only exists in the first user's transient space. cheers stefan > > After consolidation, above change log would look like > > +/a/y:{} >/a:/b > > and can be saved. > > Michael > > >> >> cheers >> stefan >> >>> >>> Michael >>> >>> >>> [1] http://markmail.org/message/qkkcvtmtapas2cx4 >>> [2] >>> >>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/= src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=3Dmarkup >>> [3] >>> >>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/= src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=3Dmar= kup