On 4.2.12 20:07, Stefan Guggisberg wrote:
> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<mduerig@apache.org> wrote:
>>
>> Hi,
>>
>> While working on a new transient space implementation [1] I discovered that
>> recreating a list of changes from transient modifications to the hierarchy
>> is like opening a can of worms. Rethinking things it turned out, that it is
>> much simpler to directly consolidate the change log. This approach is much
>> more flexible and versatile since knowledge of the persisted hierarchy is
>> not necessary at all. Thus this approach should work equally well with
>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>
>> The algorithm is capable of creating minimal change logs from any valid
>> change log (that is from any change log that can be applied to some
>> hierarchy).
>>
>> I extensively tested the algorithm on randomly created change logs [3]. This
>> resulted in some interesting numbers: consolidation results in an decrease
>> of about 45% of these randomly generated change logs. Furthermore the
>> resulting size of the H2 database files *decreased up to 200%*!
>>
>> The implementation [2] is currently part of the jackrabbitmicrokernel
>> module in the sandbox. However it does not have any crucial dependencies so
>> it should be easy to adapt for other usages.
>>
>> The gory details of the algorithm are based on some algebraic properties of
>> move operations on paths as detailed in the following paragraphs.
>>
>> The change log consolidation algorithm implemented in the reduce method [2]
>> is based on algebraic properties of move operations on paths. The other
>> operations (add node, remove node and set property) are generalized to move
>> operations. Consolidation relies on reduction and commutation rules of move
>> operations.
>>
>> A move operation resembles a map on a hierarchy (of nodes and properties). A
>> change log consisting of k move operations m_1 to m_k is thus the
>> composition of the individual moves: m_1 *, ..., * m_k (using * for function
>> composition: f(g(x)) = (f * g)(x)).
>>
>> Some definitions, notation and propositions:
>>
>> * Let NIL denote a path which never occurs in any hierarchy.
>>
>> * Order on paths: let p, q be paths.
>>  p< q iff p != NIL and q != NIL and p is an ancestor of q.
>>  p<= q iff p< q or p == q != NIL
>>
>> * Conflict of paths: let p, q be paths.
>>  p ~ q (p conflicts with q) iff either p<= q or q<= p
>>
>> * Substitution in paths: let p, q, r be paths.
>>  [p > q]r = r if p is not an ancestor of r and
>>  [p > q]r = s where s is the path resulting from replacing
>> the ancestor p in r with q otherwise.
>>
>> * Let p, q be paths. Then p:q denotes a move operation where the
>> node at p is moved to a new node at q.
>>
>> * Valid moves: leq p, q be paths.
>>  p:q is valid iff p !~ q or p = q
>>  if p:q is not valid, it is invalid
>>
>> Invalid moves are exactly those which would result in a node being moved to
>> an ancestor/descendant of itself.
>>
>> * Identity on moves: let p, q be paths.
>>  p:q = ID iff p = q.
>>
>> * Conflict on moves: let p, q, r, s be paths.
>>  p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>> or q ~ r or q ~ s.
>>
>> * Strict commutativity of moves: let m, n be moves.
>>  m * n = n * m iff m !~ n
>>
>> * Substitutions in moves: let p, q, r, s be paths.
>>  [p > q]r:s = [p > q]r:[p > q]s
>>
>> * Let p be a path and let +p denote an add node operation and let p
>> denote a remove node operation for a node at path p.
>>  +p = NIL:p That is, adding a node is represented by a move from a
>> unknown source.
>>  p = p:NIL. That is, removing a node is represented by a move to an
>> unknown sink.
>>
>>
>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID and
>> n != ID. Then the following reduction and commutation rules apply:
>>
>> 1. p!~ r: m * n = n * m
>> 2. p< r: illegal (since this implies q<= r which implies p ~ q and
>> thus m invalid)
>> 3. p = r: illegal (since this implies q<= r which implies p ~ q and
>> this m invalid)
>> 4. p> r: does not commute if q< s. Otherwise m * n = n * [r > s]m
>> 5. p!~ s: m * n = n * m
>> 6. p< s: illegal (since this implies p ~ q and thus m invalid)
>> 7. p = s: does not commute
>> 8. p> s: illegal (since p> s implies there is an s already which
>> will conflict with r:s)
>> 9. q!~ r: m * n = n * m
>> 10. q< r: m * n = [q > p]n * m
>> 11. q = r: m * n = p:s (transitivity of moves)
>> 12. q> r: m * n = n * [r > s]m
>> 13. q!~ s: m * n = n * m
>> 14. q< s: does not commute if p> r. Otherwise m * n = [q > p]n * m
>> 15. q = s: illegal (since s conflicts with r:s)
>> 16. q> s: illegal (since s conflicts with r:s)
>>
>> Allowing add node and remove node operations the following additional
>> conditions apply:
>>
>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>> reduction and commutations rules 1. to 16. apply with extra conditions on
>> 4., 10., 12. and 14.:
>>
>> 4'. if s = NIL and q = NIL then m * n = r. Otherwise if s = NIL then
>> m, n do not commute.
>> 10'. illegal if p = NIL
>> 12'. if s = NIL then m * n = r * p
>> 14'. illegal if p = NIL
>
> sounds very complicated ;) i admit that i don't
> fully understand it (ok, i didn't try very hard
> and gave up somewhere in the middle).
It is actually much simpler than it looks. Have a look at the code and
the inline comments for additional hints and my explanations below.
> could you please provide a simple example?
What I basically did is identify all cases of how moves, adds and
removes interact. This turned out to be much simpler when conceptually
handling adds as moves from 'nowhere' and deletes as moves to 'nowhere'
(aka Trash) and then specialising interaction rules for adds and deletes
from there.
One kind of interaction is reductions like
>/a:/b, >/b:/c = >/a:/c
or
/a, /a/b = /a
...
The other kind of interaction is commutation like
>/a:/b, >/c:/d = >/c:/d, >/a:/b
or
> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
...
With a complete list of such interactions (see 1.  16. in my initial
post) it is possible to reduce a change log to its minimal equivalent by
subsequent application of reductions and commutations: Given a reduced
change list and a new operation inserted at index k, try to reduce it
with the operation at k  1. If this fails commute it with the operation
at k  1 and repeat at k  1 until either a reduction has been found or
commutation fails. In the latter case do the same going forward (to k +
1). If still no reduction could be found, we are done and the change
list is minimized. Otherwise, recursively apply the reduction algorithm
treating the result of the reduction as the new operation inserted into
the rest of the change list.
This is exactly what is implemented in reduce(List<Operation>, int
index). See
http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
Example:
>/a:/x +/x/y:{} >/x:/b = (commutation of 1 and 2, rule 14.)
+/a/y:{} >/a:/x >/x:/b = (reduction of 2 and 3, rule 11.)
+/a/y:{} >/a:/b
Some key observations which make reduce() work:
 The interaction rules do not depend on the persisted hierarchy: If a
change log can be applied to a hierarchy, its minimized equivalent can
be applied to the same hierarchy. More over the observable effect on
that hierarchy will be the same.
 The reduce algorithm is minimizing since whether a reduction can occur
or not does not depend on 'where it occurs'. Above example could as well
reduce as follows:
>/a:/x +/x/y:{} >/x:/b = (commutation of 2 and 3, rule 12.)
>/a:/x >/x:/b +/b/y = (reduction of 1 and 2, rule 11.)
>/a:/b +/b/y:{}
> what is the benefit of this approach compared
> to the imlementation in jackrabbitcore
> (which does consolidate transient changes)?
 Versatility: it is not bound to any implementation, nor does it need
the actual hierarchy (i.e. it is stateless). It is basically a
realisation of a map from change logs to change logs.
 Self contained: everything in one single quite small class. No
dependencies, no clutter.
 Much simpler.
 Provably sound and complete.
 Provably minimizing on the change logs.
 Portable since it doesn't depend on a specific underlying implementation.
 Reduction of up to 45% of the change log communicated from the upper
layers to the Microkernel.
 In the case of H2, reduction of up to 200% of database size.
Apart from the benefits, it is a plain necessity for transient space
implementations on top of the Microkernel: without proper consolidation
users could end up not being able to save their (valid) transient
modifications. Consider a first user doing
>/a:/t/x +/t/x/y:{} >/t/x:/b
if in the meanwhile another user removes x the save operation for the
first user would fail.
After consolidation, above change log would look like
+/a/y:{} >/a:/b
and can be saved.
Michael
>
> cheers
> stefan
>
>>
>> Michael
>>
>>
>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>> [2]
>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>> [3]
>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbitmicrokernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup
