jackrabbit-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Apache Wiki <wikidi...@apache.org>
Subject [Jackrabbit Wiki] Update of "Clustering the Microkernel" by MarcelReutegger
Date Mon, 05 Mar 2012 14:51:03 GMT
Dear Wiki user,

You have subscribed to a wiki page or wiki category on "Jackrabbit Wiki" for change notification.

The "Clustering the Microkernel" page has been changed by MarcelReutegger:
http://wiki.apache.org/jackrabbit/Clustering%20the%20Microkernel?action=diff&rev1=3&rev2=4

- This page summarizes results from various online/offline and f2f discussions regarding clustering
of the microkernel. It tries to fix some terms, sketches approaches which came up during the
discussions and lists some of the open questions. 
+ This page summarizes results from various online/offline and f2f discussions regarding clustering
of the microkernel. It tries to fix some terms, sketches approaches which came up during the
discussions and lists some of the open questions.
  
  <<TableOfContents>>
  
  === Introduction ===
  Clustering is the common term for partitioning and replicating data. Partitioning means
splitting data up among different storage locations such that each data item exists at a single
location only. Partitioning enhances throughput through load balancing. Replication means
a data item is kept in multiple storage locations at the same time. Replication enhances reliability
through redundancy.
-                      
- The [[http://en.wikipedia.org/wiki/CAP_theorem | CAP]] theorem states that it is impossible
for a distributed computer system to provide consistency, availability and partition tolerance
at the same time. (Note there is a slight clash of terminology: in the context of the CAP
theorem partition tolerance means tolerance to the network becoming partitioned.)
  
+ The [[http://en.wikipedia.org/wiki/CAP_theorem|CAP]] theorem states that it is impossible
for a distributed computer system to provide consistency, availability and partition tolerance
at the same time. (Note there is a slight clash of terminology: in the context of the CAP
theorem partition tolerance means tolerance to the network becoming partitioned.)
+ 
- The clustering mechanism for the microkernel will trade consistency for availability in
the face of network partition and for reduced latency under normal operation. The goal is
to implement some form of [[http://en.wikipedia.org/wiki/Eventual_consistency | eventual consistency]].
+ The clustering mechanism for the microkernel will trade consistency for availability in
the face of network partition and for reduced latency under normal operation. The goal is
to implement some form of [[http://en.wikipedia.org/wiki/Eventual_consistency|eventual consistency]].
  
  === Cluster topology ===
- On a first level, a cluster splits the JCR tree into various partitions. On a second level
each partition may then be replicated to various cluster nodes. 
+ On a first level, a cluster splits the JCR tree into various partitions. On a second level
each partition may then be replicated to various cluster nodes.
  
  === Partitioning ===
- A challenge for all partitioning schemes is enforcing atomicity of commits across different
partitions. When atomicity is desired, some form of [[http://en.wikipedia.org/wiki/Atomic_commit
| atomic commitment protocol]] (ACP) has to be in place. The [[http://en.wikipedia.org/wiki/Two-phase_commit_protocol
| two-phase commit protocol]] seems to be standard here. Note however, that there is no ACP
which is completely tolerant against network failures (See [1] proposition 7.1 and 7.2). That
is, every ACP may cause cluster nodes to become blocked in the face of network failures. Furthermore
no ACP can guarantee independent recovery of failed clusters. 
+ A challenge for all partitioning schemes is enforcing atomicity of commits across different
partitions. When atomicity is desired, some form of [[http://en.wikipedia.org/wiki/Atomic_commit|atomic
commitment protocol]] (ACP) has to be in place. The [[http://en.wikipedia.org/wiki/Two-phase_commit_protocol|two-phase
commit protocol]] seems to be standard here. Note however, that there is no ACP which is completely
tolerant against network failures (See [1] proposition 7.1 and 7.2). That is, every ACP may
cause cluster nodes to become blocked in the face of network failures. Furthermore no ACP
can guarantee independent recovery of failed clusters.
  
  Move operations across partitions pose another challenge: with a naive implementation a
whole subtree would have to be physically moved.
  
  One approach for partitioning the JCR tree is splitting it up by path: each cluster node
contains the nodes pertaining to a direct child node of the root node. In this scenario the
root node resembles a set of mount points onto which the various cluster nodes are mounted.
Optimally the root node itself is stateless. An open question with this approach is how to
aggregate the individual revision histories of the partitions into a consistent revision history
of the root.
  
- Another approach for partitioning the JCR tree is splitting by subtree: in this scheme each
cluster node contains some subtree of the entire JCR tree. The leave (JCR) nodes in each cluster
node contain pointers to the cluster nodes which contain their actual child (JCR) nodes. One
specific form of this scheme would dedicate a cluster node to the root (JCR) node and a couple
of cluster nodes for the subtrees below. This scheme very much resembles the splitting by
path scheme from above only that now the root is not state less. A disadvantage is that the
root might become a bottleneck. However, since there is a dedicated cluster node for the root,
it could be optimally tuned. On the other hand this scheme does not have the problem of aggregating
the revision histories since these are effectively stored in the root node. Furthermore this
scheme lends itself nicely to [[http://en.wikipedia.org/wiki/MapReduce | map reduce]] approaches.

+ Another approach for partitioning the JCR tree is splitting by subtree: in this scheme each
cluster node contains some subtree of the entire JCR tree. The leave (JCR) nodes in each cluster
node contain pointers to the cluster nodes which contain their actual child (JCR) nodes. One
specific form of this scheme would dedicate a cluster node to the root (JCR) node and a couple
of cluster nodes for the subtrees below. This scheme very much resembles the splitting by
path scheme from above only that now the root is not state less. A disadvantage is that the
root might become a bottleneck. However, since there is a dedicated cluster node for the root,
it could be optimally tuned. On the other hand this scheme does not have the problem of aggregating
the revision histories since these are effectively stored in the root node. Furthermore this
scheme lends itself nicely to [[http://en.wikipedia.org/wiki/MapReduce|map reduce]] approaches.
  
  === Replication ===
- Replicating the same data items to various cluster nodes poses consistency concerns. A eventually
consistent system guarantees, that after a finite time of quiescence, all cluster nodes in
a replication set are in the same state. A possible approach to combine JCR with eventually
consistent replication transparently, is to make cluster synchronization appear as normal
session operations to clients. That is, when some items on a cluster node change due to a
synchronization activity, the cluster node will generate JCR observation events for all sessions
currently open on this cluster node. 
+ Replicating the same data items to various cluster nodes poses consistency concerns. A eventually
consistent system guarantees, that after a finite time of quiescence, all cluster nodes in
a replication set are in the same state. A possible approach to combine JCR with eventually
consistent replication transparently, is to make cluster synchronization appear as normal
session operations to clients. That is, when some items on a cluster node change due to a
synchronization activity, the cluster node will generate JCR observation events for all sessions
currently open on this cluster node.
  
- An open question is how to best implement cluster node synchronization: how (topological)
should changes in cluster propagate to other clusters in order to ensure convergence. That
is, in order not to arrive at an unstable system which propagates changes forth and back between
cluster nodes. One approach which resembles branching and merging is proposed in [2]. However
the branch operation doesn't seem viable. Another approach which might be more flexible is
to use [[http://en.wikipedia.org/wiki/Vector_clock | vector clocks]] for establishing a happens
before relation on transactions from different cluster nodes. 
+ An open question is how to best implement cluster node synchronization: how (topological)
should changes in cluster propagate to other clusters in order to ensure convergence. That
is, in order not to arrive at an unstable system which propagates changes forth and back between
cluster nodes. One approach which resembles branching and merging is proposed in [2]. However
the branch operation doesn't seem viable. Another approach which might be more flexible is
to use [[http://en.wikipedia.org/wiki/Vector_clock|vector clocks]] for establishing a happens
before relation on transactions from different cluster nodes.
  
- Another open question is how to synchronize conflicting transactions from different cluster
nodes. Merging can be done to a certain degree. Other cases might have to rollback transactions
on some cluster nodes. Consider for example the case where a node ''x'' is moved such that
it becomes a child of node ''y'' on one cluster node and at them same time node ''y'' is moved
such that it becomes a child of node ''x'' on another cluster node. The situation is inherently
symmetric and neither move can follow the other. On cluster synchronization there seems to
be no better way than to undo one (or both?) of the transactions which contained these move
operations. 
+ Another open question is how to synchronize conflicting transactions from different cluster
nodes. Merging can be done to a certain degree. Other cases might have to rollback transactions
on some cluster nodes. Consider for example the case where a node ''x'' is moved such that
it becomes a child of node ''y'' on one cluster node and at them same time node ''y'' is moved
such that it becomes a child of node ''x'' on another cluster node. The situation is inherently
symmetric and neither move can follow the other. On cluster synchronization there seems to
be no better way than to undo one (or both?) of the transactions which contained these move
operations.
+ 
+ === Distributed B+Tree ===
+ This approach was first mentioned on the [[http://markmail.org/message/jl66nqzct7s6ruju|dev
list]]. The basic idea is to store the MK nodes in a B+Tree and use the paths of the nodes
as keys. To support efficient enumeration of child nodes, the order of the linearized MK tree
is according to a breadth first search.
+ 
+ The B+Tree could be implemented as described in the paper 'A practical scalable distributed
B-tree' [3]. The B-Tree implementation is built on top of Sinfonia [4]. Sinfonia is a distributed
block oriented storage mechanism, which supports simple transactions. Sinfonia works with
optimistic concurrency control but still uses locks for a short amount of time to commit changes.
Transactions that failed because of a conflict are simply retried until they succeed.
  
  === Presentations ===
- [[https://docs.google.com/presentation/pub?id=131sVx5s58jAKE2FSVBfUZVQSl1W820_syyzLYRHGH6E&start=false&loop=false&delayms=3000
| Clustering for Jackrabbit 3]]
+ [[https://docs.google.com/presentation/pub?id=131sVx5s58jAKE2FSVBfUZVQSl1W820_syyzLYRHGH6E&start=false&loop=false&delayms=3000|Clustering
for Jackrabbit 3]]
  
  === References ===
+ [1] [[http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx|Concurrency Control
and Recovery in Database Systems]]. Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman.
1987
  
- [1] [[http://research.microsoft.com/en-us/people/philbe/ccontrol.aspx | Concurrency Control
and Recovery in Database Systems]]. Philip A. Bernstein, Vassos Hadzilacos, Nathan Goodman.
1987
+ [2] [[http://research.microsoft.com/apps/pubs/default.aspx?id=155638|Eventually Consistent
Transactions]]. Sebastian Burckhardt, Manuel Fahndrich, Daan Leijen, and Mooly Sagiv. October
2011
  
- [2] [[http://research.microsoft.com/apps/pubs/default.aspx?id=155638 | Eventually Consistent
Transactions]]. Sebastian Burckhardt, Manuel Fahndrich, Daan Leijen, and Mooly Sagiv. October
2011
+ [3] [[http://research.microsoft.com/en-us/people/aguilera/distributed-btree-vldb2008.pdf|A
practical scalable distributed B-tree]]. Marcos K. Aguilera, Wojciech Golab, Mehul A. Shah.
2008
  
+ [4] [[http://cs.ucla.edu/~kohler/class/08w-dsi/aguilera07sinfonia.pdf|Sinfonia: a new paradigm
for building scalable distributed systems]]. Marcos K. Aguilera, Arif Merchant, Mehul Shah,
Alistair Veitch, Christos Karamanolis. 2007
+ 

Mime
View raw message