jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "Clustering the Microkernel" by MichaelDürig
Date Fri, 25 Nov 2011 17:50:45 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "Clustering the Microkernel" page has been changed by MichaelDürig:

New page:
This page summarizes results from various online/offline and f2f discussions regarding clustering
of the microkernel. It tries to fix some terms, sketches approaches which came up during the
discussions and lists some of the open questions. 

=== Introduction ===
Clustering is the common term for partitioning and replicating data. Partitioning means splitting
data up among different storage locations such that each data item exists at a single location
only. Partitioning enhances throughput through load balancing. Replication means a data item
is kept in multiple storage locations at the same time. Replication enhances reliability through
The [[http://en.wikipedia.org/wiki/CAP_theorem | CAP]] theorem states that it is impossible
for a distributed computer system to provide consistency, availability and partition tolerance
at the same time. (Note there is a slight clash of terminology: in the context of the CAP
theorem partition tolerance means tolerance to the network becoming partitioned.)

The clustering mechanism for the microkernel will trade consistency for availability in the
face of network partition and for reduced latency under normal operation. The goal is to implement
some form of [[http://en.wikipedia.org/wiki/Eventual_consistency | eventual consistency]].

=== Cluster topology ===
On a first level, a cluster splits the JCR tree into various partitions. On a second level
each partition may then be replicated to various cluster nodes. 

=== Partitioning ===
A challenge for all partitioning schemes is enforcing atomicity of commits across different
partitions. When atomicity is desired, some form of [[http://en.wikipedia.org/wiki/Atomic_commit
| atomic commitment protocol]] (ACP) has to be in place. The [[http://en.wikipedia.org/wiki/Two-phase_commit_protocol
| two-phase commit protocol]] seems to be standard here. Note however, that there is no ACP
which is completely tolerant against network failures (See [1] proposition 7.1 and 7.2). That
is, every ACP may cause cluster nodes to become blocked in the face of network failures. Furthermore
no ACP can guarantee independent recovery of failed clusters. 

Move operations across partitions pose another challenge: with a naive implementation a whole
subtree would have to be physically moved.

One approach for partitioning the JCR tree is splitting it up by path: each cluster node contains
the nodes pertaining to a direct child node of the root node. In this scenario the root node
resembles a set of mount points onto which the various cluster nodes are mounted. Optimally
the root node itself is stateless. An open question with this approach is how to aggregate
the individual revision histories of the partitions into a consistent revision history of
the root.

Another approach for partitioning the JCR tree is splitting by subtree: in this scheme each
cluster node contains some subtree of the entire JCR tree. The leave (JCR) nodes in each cluster
node contain pointers to the cluster nodes which contain their actual child (JCR) nodes. One
specific form of this scheme would dedicate a cluster node to the root (JCR) node and a couple
of cluster nodes for the subtrees below. This scheme very much resembles the splitting by
path scheme from above only that now the root is not state less. A disadvantage is that the
root might become a bottleneck. However, since there is a dedicated cluster node for the root,
it could be optimally tuned. On the other hand this scheme does not have the problem of aggregating
the revision histories since these are effectively stored in the root node. Furthermore this
scheme lends itself nicely to [[http://en.wikipedia.org/wiki/MapReduce | map reduce]] approaches.

=== Replication ===
Replicating the same data items to various cluster nodes poses consistency concerns. A eventually
consistent system guarantees, that after a finite time of quiescence, all cluster nodes in
a replication set are in the same state. A possible approach to combine JCR with eventually
consistent replication transparently, is to make cluster synchronization appear as normal
session operations to clients. That is, when some items on a cluster node change due to a
synchronization activity, the cluster node will generate JCR observation events for all sessions
currently open on this cluster node. 

An open question is how to best implement cluster node synchronization: how (topological)
should changes in cluster propagate to other clusters in order to ensure convergence. That
is, in order not to arrive at an unstable system which propagates changes forth and back between
cluster nodes. One approach which resembles branching and merging is proposed in [2]. However
the branch operation doesn't seem viable. Another approach which might be more flexible is
to use [[http://en.wikipedia.org/wiki/Vector_clock | vector clocks]] for establishing a happens
before relation on transactions from different cluster nodes. 

Another open question is how to synchronize conflicting transactions from different cluster
nodes. Merging can be done to a certain degree. Other cases might have to rollback transactions
on some cluster nodes. Consider for example the case where a node ''x'' is moved such that
it becomes a child of node ''y'' on one cluster node and at them same time node ''y'' is moved
such that it becomes a child of node ''x'' on another cluster node. The situation is inherently
symmetric and neither move can follow the other. On cluster synchronization there seems to
be no better way than to undo one (or both?) of the transactions which contained these move

[1] [[http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx | Concurrency Control
and Recovery in Database Systems]]. Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman.

[2] [[http://research.microsoft.com/apps/pubs/default.aspx?id=155638 | Eventually Consistent
Transactions]]. Sebastian Burckhardt, Manuel Fahndrich, Daan Leijen, and Mooly Sagiv. October

View raw message