Mailing-List: contact dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of mueller@adobe.com designates
 64.18.1.21 as permitted sender)
From: Thomas Mueller <mueller@adobe.com>
To: "dev@jackrabbit.apache.org" <dev@jackrabbit.apache.org>
Date: Wed, 29 Feb 2012 07:55:19 +0000
Subject: Re: [jr3] Tree model
Thread-Topic: [jr3] Tree model
Thread-Index: Acz2t32A8YULW7ZESj+/7qdsr8YV3Q==
Message-ID: <CB7394C9.26755%mueller@adobe.com>
In-Reply-To: <4F4D79B2.1050009@apache.org>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.14.0.111121
acceptlanguage: en-US
Content-Type: text/plain; charset="us-ascii"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Hi,

I think we all have more or less the same basic idea. Well, I would call
it "Node" and not "Tree" because the term node more closely matches what
we do. But first about what we seem to agree:

* A persistent and immutable data structure has many benefits
  (for example concurrency).

* Mutable data strucutures also have benefits:
  it avoids creating lots of garbage and a new root node
  for each and every modification.

I think we should use a structure that is immutable when reading, but
mutable when writing. Write methods should return a new object (for
example cloneAndAddChildNode), but only when necessary, that is, if the
revision doesn't match. When the revision matches, the internal state is
changed instead, and the cloneAndAddChildNode method returns 'this'. A
first implementation is org.apache.jackrabbit.mk.simple.NodeImpl.

> everyone has their own
> MicroKernel implementation.

The reasons why we have two implementations is, Stefan and I could not
agree on a few implementation details. My view is:

* Stefan assumes it's important to use the content hash as the node id,
while I think this hurts performance too much (as the randomly distributed
node ids hurt storage performance in Jackrabbit 2). With the copy-on-write
model it is even a bigger performance problem because each time a node is
changed it gets a new node id (I didn't think about this first). That
means even small repositories are problematic if there are many changes.

* Stefan implemented a different way to represent large child node lists.
He used a h-tree (the hash code of the child node name), which means child
nodes are pseudo-randomly distributed. I think this hurts performance too
much (same reason as above: a stored index on randomly distributed keys is
required).

 =20
* Stefan believes concurrent write operations are very important while I
believe this will not increase write throughput as much as it complicates
the implementation (Stefan implementations merges changes while my
implementation does not).

 =20
* Stefans implementation re-constructs the journal by diffing the node
tree. I believe this is troublesome because it is not always possible to
re-construct the journal with regards to re-order and move operations.
Also I believe this adds more complexity that the possible benefit
(because the journal doesn't need to be stored). In my implementation the
journal is stored.=20

There is quite a lot of common code, for example the data store, parsing
and constructing JSON / JSOP. My implementation consists of the classes in
org.apache.jackrabbit.mk.simple. The classes NodeImpl, NodeList,
NodeListSmall are used again in multiple different modules (the index and
the wrapper implementations for example). So what is really unique in my
"storage" implementation is the 3 NodeMap implementations, NodeListTrie to
support large child node lists, the Revision class, and SimpleKernelImpl.

Regards,
Thomas