On Sun, Feb 5, 2012 at 6:25 PM, Michael Dürig <mduerig@apache.org> wrote:
>
>
> On 4.2.12 20:07, Stefan Guggisberg wrote:
>>
>> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<mduerig@apache.org> wrote:
>>>
>>>
>>> Hi,
>>>
>>> While working on a new transient space implementation [1] I discovered
>>> that
>>> recreating a list of changes from transient modifications to the
>>> hierarchy
>>> is like opening a can of worms. Rethinking things it turned out, that it
>>> is
>>> much simpler to directly consolidate the change log. This approach is
>>> much
>>> more flexible and versatile since knowledge of the persisted hierarchy is
>>> not necessary at all. Thus this approach should work equally well with
>>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>>
>>> The algorithm is capable of creating minimal change logs from any valid
>>> change log (that is from any change log that can be applied to some
>>> hierarchy).
>>>
>>> I extensively tested the algorithm on randomly created change logs [3].
>>> This
>>> resulted in some interesting numbers: consolidation results in an
>>> decrease
>>> of about 45% of these randomly generated change logs. Furthermore the
>>> resulting size of the H2 database files *decreased up to 200%*!
>>>
>>> The implementation [2] is currently part of the jackrabbitmicrokernel
>>> module in the sandbox. However it does not have any crucial dependencies
>>> so
>>> it should be easy to adapt for other usages.
>>>
>>> The gory details of the algorithm are based on some algebraic properties
>>> of
>>> move operations on paths as detailed in the following paragraphs.
>>>
>>> The change log consolidation algorithm implemented in the reduce method
>>> [2]
>>> is based on algebraic properties of move operations on paths. The other
>>> operations (add node, remove node and set property) are generalized to
>>> move
>>> operations. Consolidation relies on reduction and commutation rules of
>>> move
>>> operations.
>>>
>>> A move operation resembles a map on a hierarchy (of nodes and
>>> properties). A
>>> change log consisting of k move operations m_1 to m_k is thus the
>>> composition of the individual moves: m_1 *, ..., * m_k (using * for
>>> function
>>> composition: f(g(x)) = (f * g)(x)).
>>>
>>> Some definitions, notation and propositions:
>>>
>>> * Let NIL denote a path which never occurs in any hierarchy.
>>>
>>> * Order on paths: let p, q be paths.
>>>  p< q iff p != NIL and q != NIL and p is an ancestor of q.
>>>  p<= q iff p< q or p == q != NIL
>>>
>>> * Conflict of paths: let p, q be paths.
>>>  p ~ q (p conflicts with q) iff either p<= q or q<= p
>>>
>>> * Substitution in paths: let p, q, r be paths.
>>>  [p > q]r = r if p is not an ancestor of r and
>>>  [p > q]r = s where s is the path resulting from replacing
>>> the ancestor p in r with q otherwise.
>>>
>>> * Let p, q be paths. Then p:q denotes a move operation where the
>>> node at p is moved to a new node at q.
>>>
>>> * Valid moves: leq p, q be paths.
>>>  p:q is valid iff p !~ q or p = q
>>>  if p:q is not valid, it is invalid
>>>
>>> Invalid moves are exactly those which would result in a node being moved
>>> to
>>> an ancestor/descendant of itself.
>>>
>>> * Identity on moves: let p, q be paths.
>>>  p:q = ID iff p = q.
>>>
>>> * Conflict on moves: let p, q, r, s be paths.
>>>  p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>>> or q ~ r or q ~ s.
>>>
>>> * Strict commutativity of moves: let m, n be moves.
>>>  m * n = n * m iff m !~ n
>>>
>>> * Substitutions in moves: let p, q, r, s be paths.
>>>  [p > q]r:s = [p > q]r:[p > q]s
>>>
>>> * Let p be a path and let +p denote an add node operation and let p
>>> denote a remove node operation for a node at path p.
>>>  +p = NIL:p That is, adding a node is represented by a move from a
>>> unknown source.
>>>  p = p:NIL. That is, removing a node is represented by a move to an
>>> unknown sink.
>>>
>>>
>>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
>>> and
>>> n != ID. Then the following reduction and commutation rules apply:
>>>
>>> 1. p!~ r: m * n = n * m
>>> 2. p< r: illegal (since this implies q<= r which implies p ~ q and
>>> thus m invalid)
>>> 3. p = r: illegal (since this implies q<= r which implies p ~ q and
>>> this m invalid)
>>> 4. p> r: does not commute if q< s. Otherwise m * n = n * [r >
s]m
>>> 5. p!~ s: m * n = n * m
>>> 6. p< s: illegal (since this implies p ~ q and thus m invalid)
>>> 7. p = s: does not commute
>>> 8. p> s: illegal (since p> s implies there is an s already which
>>> will conflict with r:s)
>>> 9. q!~ r: m * n = n * m
>>> 10. q< r: m * n = [q > p]n * m
>>> 11. q = r: m * n = p:s (transitivity of moves)
>>> 12. q> r: m * n = n * [r > s]m
>>> 13. q!~ s: m * n = n * m
>>> 14. q< s: does not commute if p> r. Otherwise m * n = [q > p]n
* m
>>> 15. q = s: illegal (since s conflicts with r:s)
>>> 16. q> s: illegal (since s conflicts with r:s)
>>>
>>> Allowing add node and remove node operations the following additional
>>> conditions apply:
>>>
>>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>>> reduction and commutations rules 1. to 16. apply with extra conditions on
>>> 4., 10., 12. and 14.:
>>>
>>> 4'. if s = NIL and q = NIL then m * n = r. Otherwise if s = NIL then
>>> m, n do not commute.
>>> 10'. illegal if p = NIL
>>> 12'. if s = NIL then m * n = r * p
>>> 14'. illegal if p = NIL
>>
>>
>> sounds very complicated ;) i admit that i don't
>> fully understand it (ok, i didn't try very hard
>> and gave up somewhere in the middle).
>
>
> It is actually much simpler than it looks. Have a look at the code and the
> inline comments for additional hints and my explanations below.
>
>
>> could you please provide a simple example?
>
>
> What I basically did is identify all cases of how moves, adds and removes
> interact. This turned out to be much simpler when conceptually handling adds
> as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and
> then specialising interaction rules for adds and deletes from there.
>
> One kind of interaction is reductions like
>
>>/a:/b, >/b:/c = >/a:/c
> or
> /a, /a/b = /a
> ...
>
> The other kind of interaction is commutation like
>
>>/a:/b, >/c:/d = >/c:/d, >/a:/b
> or
>> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
> ...
>
> With a complete list of such interactions (see 1.  16. in my initial post)
> it is possible to reduce a change log to its minimal equivalent by
> subsequent application of reductions and commutations: Given a reduced
> change list and a new operation inserted at index k, try to reduce it with
> the operation at k  1. If this fails commute it with the operation at k  1
> and repeat at k  1 until either a reduction has been found or commutation
> fails. In the latter case do the same going forward (to k + 1). If still no
> reduction could be found, we are done and the change list is minimized.
> Otherwise, recursively apply the reduction algorithm treating the result of
> the reduction as the new operation inserted into the rest of the change
> list.
> This is exactly what is implemented in reduce(List<Operation>, int index).
> See
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> Example:
>>/a:/x +/x/y:{} >/x:/b = (commutation of 1 and 2, rule 14.)
> +/a/y:{} >/a:/x >/x:/b = (reduction of 2 and 3, rule 11.)
> +/a/y:{} >/a:/b
>
> Some key observations which make reduce() work:
>  The interaction rules do not depend on the persisted hierarchy: If a
> change log can be applied to a hierarchy, its minimized equivalent can be
> applied to the same hierarchy. More over the observable effect on that
> hierarchy will be the same.
>
>  The reduce algorithm is minimizing since whether a reduction can occur or
> not does not depend on 'where it occurs'. Above example could as well reduce
> as follows:
>
>>/a:/x +/x/y:{} >/x:/b = (commutation of 2 and 3, rule 12.)
>>/a:/x >/x:/b +/b/y = (reduction of 1 and 2, rule 11.)
>>/a:/b +/b/y:{}
>
>
>> what is the benefit of this approach compared
>> to the imlementation in jackrabbitcore
>> (which does consolidate transient changes)?
>
>
>  Versatility: it is not bound to any implementation, nor does it need the
> actual hierarchy (i.e. it is stateless). It is basically a realisation of a
> map from change logs to change logs.
ok, agreed.
>
>  Self contained: everything in one single quite small class. No
> dependencies, no clutter.
>
>  Much simpler.
erm, i beg to differ ... ;)
>
>  Provably sound and complete.
>
>  Provably minimizing on the change logs.
>
>  Portable since it doesn't depend on a specific underlying implementation.
>
>  Reduction of up to 45% of the change log communicated from the upper
> layers to the Microkernel.
great! however, to be fair, you're comparing 2 different change logbased
transient space implementations in the sandbox.
transient space implementations don't necessarily need to be change logbased
(see e.g. jackrabbitcore). i assume you would see little (if not no) reduction
in the number of persisted changes per save operation when comparing to the
jackrabbitcore transient space implementation. OTOH, i am not saying
that the jackrabbitcore approach is suitable when using a microkernelbased
backend and i am not questioning the usefulness of consolidated change logs
in general.
>
>  In the case of H2, reduction of up to 200% of database size.
see my previous comment
>
> Apart from the benefits, it is a plain necessity for transient space
> implementations on top of the Microkernel: without proper consolidation
> users could end up not being able to save their (valid) transient
> modifications. Consider a first user doing
>
>>/a:/t/x +/t/x/y:{} >/t/x:/b
>
> if in the meanwhile another user removes x the save operation for the first
> user would fail.
how could another user remove /t/x? it only exists in the first user's
transient space.
cheers
stefan
>
> After consolidation, above change log would look like
>
> +/a/y:{} >/a:/b
>
> and can be saved.
>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> Michael
>>>
>>>
>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>> [2]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>>> [3]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup
