jackrabbit-oak-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Guggisberg <stefan.guggisb...@gmail.com>
Subject Re: Question: wide hierachies
Date Wed, 17 Oct 2012 15:46:09 GMT
hi ian,

On Sun, Oct 14, 2012 at 2:49 AM, Ian Boston <ieb@tfd.co.uk> wrote:
> Hi,
> IIUC Oak has the potential to handle wide hierarchies where there are
> millions of child nodes to a parent and hence support some of the use
> cases commonly found in Social Media and cloud applications. This
> question is about the implementation to support that.
> IIUC from reading the code child nodes are accessed from the parent
> node, and to avoid the situation in JR2 where child nodes were
> represented as an array of objects stored with the parent node, child
> nodes in Oak are represented in a internal tree structure where the
> width of the tree is controlled to ensure concurrency and performance
> is maintained. (did I understand that correctly ?).

in JR2 the list of child node entries (name-id pairs) is stored within the
parent node. while this is very efficient for small to medium sized child
node lists (up to a couple of k entries) read and write performance suffers
when the list grows very large.

in the current MicroKernel implementation we use an htree based intermediate
tree structure for storing large child node collections.

> In choosing this approach, was a possible alternative considered where
> the child node contains a single pointer to its parent, and the child
> node is retrieved by hashing it's path rather ie  hash-f(path) ===
> child-node-storage-id, (eg sha1(path) == child-node-storage-id )
> rather than navigating from the parent.

the data model of the current MicroKernel implementation is essentially
just a DAG of node objects. the versioning model is git-like, i.e.
every change results in a new immutable tree.

i don't see how child->parent relationships could be implemented with
this model. IIUC with path-based hashing you'd sacrifice MVCC
or am i missing something?

> In my exposure to use cases for social media and cloud like systems I
> have observed several things. Listing child nodes from a parent
> becomes rarer the closer the application moves towards social media.
> Direct pointer based access derived from a hierarchical url is the
> main use case. For non hierarchal pointer, searching becomes dominant.
> Occasionally listing is required but only as a variant of search.
> Scans or sorted scans as nearly always avoided as being impractical.
> In social media content, the cardinality of all parents is low making
> finding children of parents amenable to an inverted index approach. Ie
> the parent becomes a search keyword term.
> (the closer to Enterprise Content Management the application is the
> inverse of the above is true)
> If I understood all of that correctly, my question:
> In circumstances where the use cases are more to the social media end
> of the spectrum, will it be possible to invert the child-parent
> relationship within Oak to be a pointer from the child with an
> optional, additional inverted index should finding children of a
> specific parent be required ?

WRT the MicroKernel implementation:

IMO that's unfortunately not possible with the current git-like
approach (content-based identifiers, DAG of immutable versions).


> Apologies if this is already covered and I missed it. The archives at
> Markmail didn't turn anything up.
> Thanks
> Ian

View raw message