jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Guggisberg <stefan.guggisb...@gmail.com>
Subject Re: [jr3 microkernel] Change log consolidation
Date Mon, 06 Feb 2012 13:46:46 GMT
On Sun, Feb 5, 2012 at 6:25 PM, Michael Dürig <mduerig@apache.org> wrote:
>
>
> On 4.2.12 20:07, Stefan Guggisberg wrote:
>>
>> On Fri, Feb 3, 2012 at 7:44 PM, Michael Dürig<mduerig@apache.org>  wrote:
>>>
>>>
>>> Hi,
>>>
>>> While working on a new transient space implementation [1] I discovered
>>> that
>>> recreating a list of changes from transient modifications to the
>>> hierarchy
>>> is like opening a can of worms. Rethinking things it turned out, that it
>>> is
>>> much simpler to directly consolidate the change log. This approach is
>>> much
>>> more flexible and versatile since knowledge of the persisted hierarchy is
>>> not necessary at all. Thus this approach should work equally well with
>>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>>
>>> The algorithm is capable of creating minimal change logs from any valid
>>> change log (that is from any change log that can be applied to some
>>> hierarchy).
>>>
>>> I extensively tested the algorithm on randomly created change logs [3].
>>> This
>>> resulted in some interesting numbers: consolidation results in an
>>> decrease
>>> of about 45% of these randomly generated change logs. Furthermore the
>>> resulting size of the H2 database files *decreased up to 200%*!
>>>
>>> The implementation [2] is currently part of the jackrabbit-microkernel
>>> module in the sandbox. However it does not have any crucial dependencies
>>> so
>>> it should be easy to adapt for other usages.
>>>
>>> The gory details of the algorithm are based on some algebraic properties
>>> of
>>> move operations on paths as detailed in the following paragraphs.
>>>
>>> The change log consolidation algorithm implemented in the reduce method
>>> [2]
>>> is based on algebraic properties of move operations on paths. The other
>>> operations (add node, remove node and set property) are generalized to
>>> move
>>> operations. Consolidation relies on reduction and commutation rules of
>>> move
>>> operations.
>>>
>>> A move operation resembles a map on a hierarchy (of nodes and
>>> properties). A
>>> change log consisting of k move operations m_1 to m_k is thus the
>>> composition of the individual moves: m_1 *, ..., * m_k (using * for
>>> function
>>> composition: f(g(x)) = (f * g)(x)).
>>>
>>> Some definitions, notation and propositions:
>>>
>>> * Let NIL denote a path which never occurs in any hierarchy.
>>>
>>> * Order on paths: let p, q be paths.
>>>  - p<  q iff p != NIL and q != NIL and p is an ancestor of q.
>>>  - p<= q iff p<  q or p == q != NIL
>>>
>>> * Conflict of paths: let p, q be paths.
>>>  - p ~ q (p conflicts with q) iff either p<= q or q<= p
>>>
>>> * Substitution in paths: let p, q, r be paths.
>>>  - [p ->  q]r = r if p is not an ancestor of r and
>>>  - [p ->  q]r = s where s is the path resulting from replacing
>>>    the ancestor p in r with q otherwise.
>>>
>>> * Let p, q be paths. Then p:q denotes a move operation where the
>>>  node at p is moved to a new node at q.
>>>
>>> * Valid moves: leq p, q be paths.
>>>  - p:q is valid iff p !~ q or p = q
>>>  - if p:q is not valid, it is invalid
>>>
>>> Invalid moves are exactly those which would result in a node being moved
>>> to
>>> an ancestor/descendant of itself.
>>>
>>> * Identity on moves: let p, q be paths.
>>>  - p:q = ID iff p = q.
>>>
>>> * Conflict on moves: let p, q, r, s be paths.
>>>  - p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>>>    or q ~ r or q ~ s.
>>>
>>> * Strict commutativity of moves: let m, n be moves.
>>>  - m * n = n * m iff m !~ n
>>>
>>> * Substitutions in moves: let p, q, r, s be paths.
>>>  - [p ->  q]r:s = [p ->  q]r:[p ->  q]s
>>>
>>> * Let p be a path and let +p denote an add node operation and let -p
>>>  denote a remove node operation for a node at path p.
>>>  - +p = NIL:p That is, adding a node is represented by a move from a
>>>    unknown source.
>>>  - p = p:NIL. That is, removing a node is represented by a move to an
>>>    unknown sink.
>>>
>>>
>>> Let m = p:q, n = r:s with p, q, r, s != NIL be valid moves with m != ID
>>> and
>>> n != ID. Then the following reduction and commutation rules apply:
>>>
>>> 1.  p!~ r:  m * n = n * m
>>> 2.  p<  r:  illegal (since this implies q<= r which implies p ~ q and
>>>            thus m invalid)
>>> 3.  p = r:  illegal (since this implies q<= r which implies p ~ q and
>>>            this m invalid)
>>> 4.  p>  r:  does not commute if q<  s. Otherwise m * n = n * [r ->
 s]m
>>> 5.  p!~ s:  m * n = n * m
>>> 6.  p<  s:  illegal (since this implies p ~ q and thus m invalid)
>>> 7.  p = s:  does not commute
>>> 8.  p>  s:  illegal (since p>  s implies there is an s already which
>>>            will conflict with r:s)
>>> 9.  q!~ r:  m * n = n * m
>>> 10. q<  r:  m * n = [q ->  p]n * m
>>> 11. q = r:  m * n = p:s (transitivity of moves)
>>> 12. q>  r:  m * n = n * [r ->  s]m
>>> 13. q!~ s:  m * n = n * m
>>> 14. q<  s:  does not commute if p>  r. Otherwise m * n = [q ->  p]n
* m
>>> 15. q = s:  illegal (since s conflicts with r:s)
>>> 16. q>  s:  illegal (since s conflicts with r:s)
>>>
>>> Allowing add node and remove node operations the following additional
>>> conditions apply:
>>>
>>> Let m = p:q, n = r:s be valid moves with m != ID and n != ID. Then the
>>> reduction and commutations rules 1. to 16. apply with extra conditions on
>>> 4., 10., 12. and 14.:
>>>
>>> 4'.  if s = NIL and q = NIL then m * n = -r. Otherwise if s = NIL then
>>>     m, n do not commute.
>>> 10'. illegal if p = NIL
>>> 12'. if s = NIL then m * n = -r * -p
>>> 14'. illegal if p = NIL
>>
>>
>> sounds very complicated ;) i admit that i don't
>> fully understand it (ok, i didn't try very hard
>> and gave up somewhere in the middle).
>
>
> It is actually much simpler than it looks. Have a look at the code and the
> inline comments for additional hints and my explanations below.
>
>
>> could you please provide a simple example?
>
>
> What I basically did is identify all cases of how moves, adds and removes
> interact. This turned out to be much simpler when conceptually handling adds
> as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and
> then specialising interaction rules for adds and deletes from there.
>
> One kind of interaction is reductions like
>
>>/a:/b, >/b:/c = >/a:/c
> or
> -/a, -/a/b = -/a
> ...
>
> The other kind of interaction is commutation like
>
>>/a:/b, >/c:/d = >/c:/d, >/a:/b
> or
>> /a:/x, +/x/y:{} = +/a/y:{}, >/a:/x
> ...
>
> With a complete list of such interactions (see 1. - 16. in my initial post)
> it is possible to reduce a change log to its minimal equivalent by
> subsequent application of reductions and commutations: Given a reduced
> change list and a new operation inserted at index k, try to reduce it with
> the operation at k - 1. If this fails commute it with the operation at k - 1
> and repeat at k - 1 until either a reduction has been found or commutation
> fails. In the latter case do the same going forward (to k + 1). If still no
> reduction could be found, we are done and the change list is minimized.
> Otherwise, recursively apply the reduction algorithm treating the result of
> the reduction as the new operation inserted into the rest of the change
> list.
> This is exactly what is implemented in reduce(List<Operation>, int index).
> See
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>
> Example:
>>/a:/x +/x/y:{} >/x:/b =  (commutation of 1 and 2, rule 14.)
> +/a/y:{} >/a:/x >/x:/b =  (reduction of 2 and 3, rule 11.)
> +/a/y:{} >/a:/b
>
> Some key observations which make reduce() work:
> - The interaction rules do not depend on the persisted hierarchy: If a
> change log can be applied to a hierarchy, its minimized equivalent can be
> applied to the same hierarchy. More over the observable effect on that
> hierarchy will be the same.
>
> - The reduce algorithm is minimizing since whether a reduction can occur or
> not does not depend on 'where it occurs'. Above example could as well reduce
> as follows:
>
>>/a:/x +/x/y:{} >/x:/b =  (commutation of 2 and 3, rule 12.)
>>/a:/x >/x:/b +/b/y =     (reduction of 1 and 2, rule 11.)
>>/a:/b +/b/y:{}
>
>
>> what is the benefit of this approach compared
>> to the imlementation in jackrabbit-core
>> (which does consolidate transient changes)?
>
>
> - Versatility: it is not bound to any implementation, nor does it need the
> actual hierarchy (i.e. it is stateless). It is basically a realisation of a
> map from change logs to change logs.

ok, agreed.

>
> - Self contained: everything in one single quite small class. No
> dependencies, no clutter.
>
> - Much simpler.

erm, i beg to differ ... ;)

>
> - Provably sound and complete.
>
> - Provably minimizing on the change logs.
>
> - Portable since it doesn't depend on a specific underlying implementation.
>
> - Reduction of up to 45% of the change log communicated from the upper
> layers to the Microkernel.

great! however, to be fair, you're comparing 2 different change log-based
transient space implementations in the sandbox.

transient space implementations don't necessarily need to be change log-based
(see e.g. jackrabbit-core). i assume you would see little (if not no) reduction
in the number of persisted changes per save operation when comparing to the
jackrabbit-core transient space implementation. OTOH, i am not saying
that the jackrabbit-core approach is suitable when using a microkernel-based
backend and i am not questioning the usefulness of consolidated change logs
in general.

>
> - In the case of H2, reduction of up to 200% of database size.

see my previous comment

>
> Apart from the benefits, it is a plain necessity for transient space
> implementations on top of the Microkernel: without proper consolidation
> users could end up not being able to save their (valid) transient
> modifications. Consider a first user doing
>
>>/a:/t/x +/t/x/y:{} >/t/x:/b
>
> if in the meanwhile another user removes x the save operation for the first
> user would fail.

how could another user remove /t/x? it only exists in the first user's
transient space.

cheers
stefan

>
> After consolidation, above change log would look like
>
> +/a/y:{} >/a:/b
>
> and can be saved.
>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> Michael
>>>
>>>
>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>> [2]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=markup
>>> [3]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=markup

Mime
View raw message