Mailing-List: contact oak-dev-help@jackrabbit.apache.org; run by ezmlm
Precedence: bulk
Reply-To: oak-dev@jackrabbit.apache.org
Received-SPF: pass (athena.apache.org: domain of jukka.zitting@gmail.com
 designates 209.85.212.42 as permitted sender)
MIME-Version: 1.0
From: Jukka Zitting <jukka.zitting@gmail.com>
Date: Tue, 20 Nov 2012 18:24:32 +0200
Message-ID: 
 <CAOFYJNbQtGKss6PFFJuT+H2KU5fHXut7J37o0rey9paT32CH+g@mail.gmail.com>
Subject: Identifier- or hash-based access in the MicroKernel
To: Oak devs <oak-dev@jackrabbit.apache.org>
Content-Type: text/plain; charset=ISO-8859-1

Hi,

A lot of functionality in Oak (node states, the diff and hook
mechanisms, etc.) are based on walking down the tree hierarchy one
level at a time. To do this, for example to access changes below
/a/b/c, oak-core will currently request paths /a, /a/b, /a/b/c and so
on from the underlying MK implementation.

This would work reasonably well with MK implementations that are
essentially big hash table that map the full path (and revision) to
the content at that location. Even then there's some space overhead as
even tiny nodes (think of an ACL entry) get paired with the full path
(and revision) of the node. The current MongoMK with its path keys
works like this, though even there a secondary index is needed for the
path lookups.

The approach is less ideal for MK implementations (like the default
H2-based one) that have to traverse the path when some content is
accessed. For example, with the above oak-core access pattern, the
sequence of accessed nodes would be [ a, a, b, a, b, c ], where
ideally just [ a, b, c ] would suffice. The KernelNodeStore cache in
oak-core prevents this from being too big an issue, but ideally we'd
be able to avoid such extra levels of caching.

To solve that mismatch without impacting the overall architecture too
much I'd like to propose the following:

* When requested using the filter argument, the getNodes() call may
(but is not required to) return special ":hash" or ":id" properties as
parts of the (possibly otherwise empty) child node objects included in
the JSON response.

* When returned by getNodes(), those values can be used by the client
instead of the normal path argument when requesting the content of
such child nodes using other getNodes() calls. The MK implementation
is expected to automatically detect whether a given string argument is
a path, a hash or an identifier, possibly as simply as looking at
whether it starts with a slash.

* Both ":hash" and ":id" values are expected to uniquely identify a
specific immutable state of a node. The only difference is that the
inequality of two hashes implies the inequality of the referenced
nodes (which can be used by oak-core to optimize some operations),
whereas it's possible for two different ids to refer to nodes with the
exact same content.

Such a solution would allow the following sequence

   getNodes("/") => { "a": {} }
   getNodes("/a") => { "b": {} }
   getNodes("/a/b") => { "c": {} }
   getNodes("/a/b/c") => {}

to become something like

   getNodes("/") => { "a": { ":id": "x" } }
   getNodes("x") => { "b": { :id": "y" } }
   getNodes("y") => { "c": { :id": "z"} }
   getNodes("z") => {}

with x, y and z being some implementation-specific identifiers, like
ObjectIDs in MongoDB.

In any case the MK implementation would still be required to support
access by full path.

BR,

Jukka Zitting