From dev-return-34062-apmail-jackrabbit-dev-archive=jackrabbit.apache.org@jackrabbit.apache.org Mon Feb 6 13:47:13 2012
Return-Path:
X-Original-To: apmail-jackrabbit-dev-archive@www.apache.org
Delivered-To: apmail-jackrabbit-dev-archive@www.apache.org
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
by minotaur.apache.org (Postfix) with SMTP id 8C1049749
for ; Mon, 6 Feb 2012 13:47:13 +0000 (UTC)
Received: (qmail 58768 invoked by uid 500); 6 Feb 2012 13:47:13 -0000
Delivered-To: apmail-jackrabbit-dev-archive@jackrabbit.apache.org
Received: (qmail 58731 invoked by uid 500); 6 Feb 2012 13:47:12 -0000
Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
List-Help:
List-Unsubscribe:
List-Post:
List-Id:
Reply-To: dev@jackrabbit.apache.org
Delivered-To: mailing list dev@jackrabbit.apache.org
Received: (qmail 58724 invoked by uid 99); 6 Feb 2012 13:47:12 -0000
Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 13:47:12 +0000
X-ASF-Spam-Status: No, hits=-0.7 required=5.0
tests=RCVD_IN_DNSWL_LOW,SPF_PASS
X-Spam-Check-By: apache.org
Received-SPF: pass (athena.apache.org: domain of stefan.guggisberg@gmail.com designates 209.85.210.170 as permitted sender)
Received: from [209.85.210.170] (HELO mail-iy0-f170.google.com) (209.85.210.170)
by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 06 Feb 2012 13:47:07 +0000
Received: by iakk32 with SMTP id k32so17613482iak.1
for ; Mon, 06 Feb 2012 05:46:46 -0800 (PST)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
d=gmail.com; s=gamma;
h=mime-version:in-reply-to:references:date:message-id:subject:from:to
:content-type:content-transfer-encoding;
bh=2JfMnLMfdAT+okiMxrZtJYg3wNmJZFG0a0m+eVySemg=;
b=qVcXCIeQIdYC5R7lQtxf8ynk2TNvui6Odyg/IbY8hpMyEm7U7zoLM6Ik0aRk/tCVSO
2LmiocYr68ME2bVv3X7NizvnpMvp/ULnElelTQaCkDhNO18bGW9bNWgjx32VaX+qCs9a
fBETXFykrRFZYFuScamhHzcKNR/ev1B892gzM=
MIME-Version: 1.0
Received: by 10.50.180.233 with SMTP id dr9mr10454313igc.11.1328536006843;
Mon, 06 Feb 2012 05:46:46 -0800 (PST)
Received: by 10.42.147.197 with HTTP; Mon, 6 Feb 2012 05:46:46 -0800 (PST)
In-Reply-To: <4F2EBBA3.2080802@apache.org>
References: <4F2C2B0E.7050508@apache.org>
<4F2EBBA3.2080802@apache.org>
Date: Mon, 6 Feb 2012 14:46:46 +0100
Message-ID:
Subject: Re: [jr3 microkernel] Change log consolidation
From: Stefan Guggisberg
To: dev@jackrabbit.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable
On Sun, Feb 5, 2012 at 6:25 PM, Michael D=FCrig wrote:
>
>
> On 4.2.12 20:07, Stefan Guggisberg wrote:
>>
>> On Fri, Feb 3, 2012 at 7:44 PM, Michael D=FCrig =A0w=
rote:
>>>
>>>
>>> Hi,
>>>
>>> While working on a new transient space implementation [1] I discovered
>>> that
>>> recreating a list of changes from transient modifications to the
>>> hierarchy
>>> is like opening a can of worms. Rethinking things it turned out, that i=
t
>>> is
>>> much simpler to directly consolidate the change log. This approach is
>>> much
>>> more flexible and versatile since knowledge of the persisted hierarchy =
is
>>> not necessary at all. Thus this approach should work equally well with
>>> change logs for jr3, the Microkernel, Jackrabbit core, JCR and SPI.
>>>
>>> The algorithm is capable of creating minimal change logs from any valid
>>> change log (that is from any change log that can be applied to some
>>> hierarchy).
>>>
>>> I extensively tested the algorithm on randomly created change logs [3].
>>> This
>>> resulted in some interesting numbers: consolidation results in an
>>> decrease
>>> of about 45% of these randomly generated change logs. Furthermore the
>>> resulting size of the H2 database files *decreased up to 200%*!
>>>
>>> The implementation [2] is currently part of the jackrabbit-microkernel
>>> module in the sandbox. However it does not have any crucial dependencie=
s
>>> so
>>> it should be easy to adapt for other usages.
>>>
>>> The gory details of the algorithm are based on some algebraic propertie=
s
>>> of
>>> move operations on paths as detailed in the following paragraphs.
>>>
>>> The change log consolidation algorithm implemented in the reduce method
>>> [2]
>>> is based on algebraic properties of move operations on paths. The other
>>> operations (add node, remove node and set property) are generalized to
>>> move
>>> operations. Consolidation relies on reduction and commutation rules of
>>> move
>>> operations.
>>>
>>> A move operation resembles a map on a hierarchy (of nodes and
>>> properties). A
>>> change log consisting of k move operations m_1 to m_k is thus the
>>> composition of the individual moves: m_1 *, ..., * m_k (using * for
>>> function
>>> composition: f(g(x)) =3D (f * g)(x)).
>>>
>>> Some definitions, notation and propositions:
>>>
>>> * Let NIL denote a path which never occurs in any hierarchy.
>>>
>>> * Order on paths: let p, q be paths.
>>> =A0- p< =A0q iff p !=3D NIL and q !=3D NIL and p is an ancestor of q.
>>> =A0- p<=3D q iff p< =A0q or p =3D=3D q !=3D NIL
>>>
>>> * Conflict of paths: let p, q be paths.
>>> =A0- p ~ q (p conflicts with q) iff either p<=3D q or q<=3D p
>>>
>>> * Substitution in paths: let p, q, r be paths.
>>> =A0- [p -> =A0q]r =3D r if p is not an ancestor of r and
>>> =A0- [p -> =A0q]r =3D s where s is the path resulting from replacing
>>> =A0 =A0the ancestor p in r with q otherwise.
>>>
>>> * Let p, q be paths. Then p:q denotes a move operation where the
>>> =A0node at p is moved to a new node at q.
>>>
>>> * Valid moves: leq p, q be paths.
>>> =A0- p:q is valid iff p !~ q or p =3D q
>>> =A0- if p:q is not valid, it is invalid
>>>
>>> Invalid moves are exactly those which would result in a node being move=
d
>>> to
>>> an ancestor/descendant of itself.
>>>
>>> * Identity on moves: let p, q be paths.
>>> =A0- p:q =3D ID iff p =3D q.
>>>
>>> * Conflict on moves: let p, q, r, s be paths.
>>> =A0- p:q ~ r:s (p:q conflicts with r:s) iff either p ~ r or p ~ s
>>> =A0 =A0or q ~ r or q ~ s.
>>>
>>> * Strict commutativity of moves: let m, n be moves.
>>> =A0- m * n =3D n * m iff m !~ n
>>>
>>> * Substitutions in moves: let p, q, r, s be paths.
>>> =A0- [p -> =A0q]r:s =3D [p -> =A0q]r:[p -> =A0q]s
>>>
>>> * Let p be a path and let +p denote an add node operation and let -p
>>> =A0denote a remove node operation for a node at path p.
>>> =A0- +p =3D NIL:p That is, adding a node is represented by a move from =
a
>>> =A0 =A0unknown source.
>>> =A0- p =3D p:NIL. That is, removing a node is represented by a move to =
an
>>> =A0 =A0unknown sink.
>>>
>>>
>>> Let m =3D p:q, n =3D r:s with p, q, r, s !=3D NIL be valid moves with m=
!=3D ID
>>> and
>>> n !=3D ID. Then the following reduction and commutation rules apply:
>>>
>>> 1. =A0p!~ r: =A0m * n =3D n * m
>>> 2. =A0p< =A0r: =A0illegal (since this implies q<=3D r which implies p ~=
q and
>>> =A0 =A0 =A0 =A0 =A0 =A0thus m invalid)
>>> 3. =A0p =3D r: =A0illegal (since this implies q<=3D r which implies p ~=
q and
>>> =A0 =A0 =A0 =A0 =A0 =A0this m invalid)
>>> 4. =A0p> =A0r: =A0does not commute if q< =A0s. Otherwise m * n =3D n * =
[r -> =A0s]m
>>> 5. =A0p!~ s: =A0m * n =3D n * m
>>> 6. =A0p< =A0s: =A0illegal (since this implies p ~ q and thus m invalid)
>>> 7. =A0p =3D s: =A0does not commute
>>> 8. =A0p> =A0s: =A0illegal (since p> =A0s implies there is an s already =
which
>>> =A0 =A0 =A0 =A0 =A0 =A0will conflict with r:s)
>>> 9. =A0q!~ r: =A0m * n =3D n * m
>>> 10. q< =A0r: =A0m * n =3D [q -> =A0p]n * m
>>> 11. q =3D r: =A0m * n =3D p:s (transitivity of moves)
>>> 12. q> =A0r: =A0m * n =3D n * [r -> =A0s]m
>>> 13. q!~ s: =A0m * n =3D n * m
>>> 14. q< =A0s: =A0does not commute if p> =A0r. Otherwise m * n =3D [q -> =
=A0p]n * m
>>> 15. q =3D s: =A0illegal (since s conflicts with r:s)
>>> 16. q> =A0s: =A0illegal (since s conflicts with r:s)
>>>
>>> Allowing add node and remove node operations the following additional
>>> conditions apply:
>>>
>>> Let m =3D p:q, n =3D r:s be valid moves with m !=3D ID and n !=3D ID. T=
hen the
>>> reduction and commutations rules 1. to 16. apply with extra conditions =
on
>>> 4., 10., 12. and 14.:
>>>
>>> 4'. =A0if s =3D NIL and q =3D NIL then m * n =3D -r. Otherwise if s =3D=
NIL then
>>> =A0 =A0 m, n do not commute.
>>> 10'. illegal if p =3D NIL
>>> 12'. if s =3D NIL then m * n =3D -r * -p
>>> 14'. illegal if p =3D NIL
>>
>>
>> sounds very complicated ;) i admit that i don't
>> fully understand it (ok, i didn't try very hard
>> and gave up somewhere in the middle).
>
>
> It is actually much simpler than it looks. Have a look at the code and th=
e
> inline comments for additional hints and my explanations below.
>
>
>> could you please provide a simple example?
>
>
> What I basically did is identify all cases of how moves, adds and removes
> interact. This turned out to be much simpler when conceptually handling a=
dds
> as moves from 'nowhere' and deletes as moves to 'nowhere' (aka Trash) and
> then specialising interaction rules for adds and deletes from there.
>
> One kind of interaction is reductions like
>
>>/a:/b, >/b:/c =3D >/a:/c
> or
> -/a, -/a/b =3D -/a
> ...
>
> The other kind of interaction is commutation like
>
>>/a:/b, >/c:/d =3D >/c:/d, >/a:/b
> or
>> /a:/x, +/x/y:{} =3D +/a/y:{}, >/a:/x
> ...
>
> With a complete list of such interactions (see 1. - 16. in my initial pos=
t)
> it is possible to reduce a change log to its minimal equivalent by
> subsequent application of reductions and commutations: Given a reduced
> change list and a new operation inserted at index k, try to reduce it wit=
h
> the operation at k - 1. If this fails commute it with the operation at k =
- 1
> and repeat at k - 1 until either a reduction has been found or commutatio=
n
> fails. In the latter case do the same going forward (to k + 1). If still =
no
> reduction could be found, we are done and the change list is minimized.
> Otherwise, recursively apply the reduction algorithm treating the result =
of
> the reduction as the new operation inserted into the rest of the change
> list.
> This is exactly what is implemented in reduce(List, int index)=
.
> See
> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/sr=
c/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=3Dmarkup
>
> Example:
>>/a:/x +/x/y:{} >/x:/b =3D =A0(commutation of 1 and 2, rule 14.)
> +/a/y:{} >/a:/x >/x:/b =3D =A0(reduction of 2 and 3, rule 11.)
> +/a/y:{} >/a:/b
>
> Some key observations which make reduce() work:
> - The interaction rules do not depend on the persisted hierarchy: If a
> change log can be applied to a hierarchy, its minimized equivalent can be
> applied to the same hierarchy. More over the observable effect on that
> hierarchy will be the same.
>
> - The reduce algorithm is minimizing since whether a reduction can occur =
or
> not does not depend on 'where it occurs'. Above example could as well red=
uce
> as follows:
>
>>/a:/x +/x/y:{} >/x:/b =3D =A0(commutation of 2 and 3, rule 12.)
>>/a:/x >/x:/b +/b/y =3D =A0 =A0 (reduction of 1 and 2, rule 11.)
>>/a:/b +/b/y:{}
>
>
>> what is the benefit of this approach compared
>> to the imlementation in jackrabbit-core
>> (which does consolidate transient changes)?
>
>
> - Versatility: it is not bound to any implementation, nor does it need th=
e
> actual hierarchy (i.e. it is stateless). It is basically a realisation of=
a
> map from change logs to change logs.
ok, agreed.
>
> - Self contained: everything in one single quite small class. No
> dependencies, no clutter.
>
> - Much simpler.
erm, i beg to differ ... ;)
>
> - Provably sound and complete.
>
> - Provably minimizing on the change logs.
>
> - Portable since it doesn't depend on a specific underlying implementatio=
n.
>
> - Reduction of up to 45% of the change log communicated from the upper
> layers to the Microkernel.
great! however, to be fair, you're comparing 2 different change log-based
transient space implementations in the sandbox.
transient space implementations don't necessarily need to be change log-bas=
ed
(see e.g. jackrabbit-core). i assume you would see little (if not no) reduc=
tion
in the number of persisted changes per save operation when comparing to the
jackrabbit-core transient space implementation. OTOH, i am not saying
that the jackrabbit-core approach is suitable when using a microkernel-base=
d
backend and i am not questioning the usefulness of consolidated change logs
in general.
>
> - In the case of H2, reduction of up to 200% of database size.
see my previous comment
>
> Apart from the benefits, it is a plain necessity for transient space
> implementations on top of the Microkernel: without proper consolidation
> users could end up not being able to save their (valid) transient
> modifications. Consider a first user doing
>
>>/a:/t/x +/t/x/y:{} >/t/x:/b
>
> if in the meanwhile another user removes x the save operation for the fir=
st
> user would fail.
how could another user remove /t/x? it only exists in the first user's
transient space.
cheers
stefan
>
> After consolidation, above change log would look like
>
> +/a/y:{} >/a:/b
>
> and can be saved.
>
> Michael
>
>
>>
>> cheers
>> stefan
>>
>>>
>>> Michael
>>>
>>>
>>> [1] http://markmail.org/message/qkkcvtmtapas2cx4
>>> [2]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/=
src/main/java/org/apache/jackrabbit/state/ChangeLog.java?view=3Dmarkup
>>> [3]
>>>
>>> http://svn.apache.org/viewvc/jackrabbit/sandbox/jackrabbit-microkernel/=
src/test/java/org/apache/jackrabbit/state/ChangeLogFuzzTest.java?view=3Dmar=
kup