jackrabbit-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Thomas Mueller <muel...@adobe.com>
Subject Re: [jr3] Tree model
Date Wed, 29 Feb 2012 15:59:56 GMT

While I agree MVCC and clustering are important, but I came to the
conclusion that those do not require a content addressable storage.

>svn and git.

My implementation is modeled after relational and NoSQL databases.
Databases are optimized for fine grained content (rows), which I believe
matches quite well with what we do on a node level. An important exception
is binary data, where we use a content addressable storage (the data
store), which I think is appropriate.

>my assumption was that our clustering implementation could leverage the
>content-addressed model

While I agree the content hash can be used to efficiently sync remote
(sub)trees, I believe it is not required to use the content hash as the
node id. Instead, the content hash can be stored as a property, or as a
*part* of the node id (not the only part of the node id).

>as long as we don't have a clear idea of how to support clustering i am
>reluctant to already give up on the content-addressable model.

I will not ask you to give up on your model, but please don't ask me to
give up my model :-)

My view for clustering is: I believe we should have a look how other
solutions work, specially NoSQL databases. So far I am not aware of a
NoSQL database that uses a content addressable storage (unless you view
git and svn as NoSQL databases). I believe we should build clustering
based on two mechanisms:

* Virtual repository to distribute (shard) the data. Please note the
content hash will not help here.

* For data that is stored in multiple repositories, use a synchronization
mechanism. This can be achieved using the journal, or using the content
hash, or both.

Both concepts do not require a content addressable storage.

>supporting flat hierarchies.
>since we can't assume that child node names do follow a specific
>pattern (e.g. n1-n99999999) i don't follow your performance-related

If people care about performance, they will use patterns. If performance
is important, then patterns are required. After playing around with bloom
filters, I came to the conclusion that there simply is no efficient way to
index randomly distributed data on disk.

>i've considered storing the diff of a commit in the revision a while ago.
>while it would be relatively easy to implement i currently don't see an
>immediate benefit compared to diffing which OTOH is very efficient
>thanks to the content-addressable model.

While efficient diffing requires a content hash, it does not require a
content addressable storage.

The main advantage to storing the commit is that the implementation is
simpler. There are advantages and disadvantages to diffing. The advantage
is that the journal doesn't need to be stored on disk. The disadvantage is
that the journal is not always as it was originally, and the
implementation is more complex.

>i imagine that in a clustering scenario there's a need to compute
>the changes between 2 non-consecutive revisions, potentially
>spanning a large number of intermediate revisions. just diffing
>2 revsisions is IMO probably more efficient than reconstructing
>the changes from a large number of revisions.

That is true. I think it would be an advantage to implement the diffing on
a higher level, that is, above the MicroKernel API. That would also allow
synchronizing repositories using the MicroKernel API.


View raw message