[ https://issues.apache.org/jira/browse/CASSANDRA193?page=com.atlassian.jira.plugin.system.issuetabpanels:commenttabpanel&focusedCommentId=12722054#action_12722054
]
Stu Hood edited comment on CASSANDRA193 at 6/20/09 9:16 PM:

The current plan is for an AntiEntropyService per table to maintain a Merkle Tree per column
family.
The tree will be implemented as a full, randomized binary tree in memory (a ''Treap'': http://en.wikipedia.org/wiki/Treap
), where every item in the tree represents a range bounded by the dht.Tokens of its left and
right neighbors. By placing a bound on the total number of nodes in the tree, we can limit
the memory usage. We can compact or split ranges in the tree by removing or adding Tokens.
The algorithm for deciding which ranges to compact/split will be described below.
When a write comes in for a given table, we will place 'invalidation' operations in a queue
for all affected column families. The ExecutorService for the table will read from the queue
and perform the 'invalidations' as fast as it can. For a given Key/Token, if any column family
tree is marked as 'invalid', the entire row needs to be read from disk and repaired (at some
point in the future).
An 'invalidation' operation does a binary search in the Merkle Tree and marks the matching
range as 'invalid', deleting its hash. We will also take advantage of this step to optimize
the tree: A ''Treap'' stores a random priority (P) on each node, and by generating a random
P' and replacing P for a node iff P' < P as we invalidate it, more frequently invalidated
ranges will shift to the top of the tree.
The AEService maintaining the tree for a table will occasionally need to exchange portions
of the tree with other nodes. In order to do this, subtrees that both nodes are interested
in from all CF trees will have to be locked long enough to recalculate all 'invalid' children,
and then the locks can flow down the tree as progressively smaller ranges are exhanged. Doing
this locking efficiently is going to be interesting (aka: I haven't thought about it).
Implementing the exchange between nodes is blocked by CASSANDRA242 because in order to align
the subtrees on different nodes, we need to be able to deterministically split two ranges.
In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide an operation
that builds a list of invalid ranges to be fetched from disk. During this step, we can also
compact/split ranges. Because of our Treap maintenance, frequently invalidated ranges will
be nearer to the top of the tree, and stable ranges will be closer to the bottom. By compacting
the deepest N leaves and expanding the shallowest N, we can minimize the size of the ranges
that are affected by invalidations in the future.
Given the list of 'invalid' ranges (and pointers directly to the tree nodes), the AEService
will fetch the ranges from the current MemTable and SSTables for the CF, hash them, and store
the hashes into the relevant nodes. After this operation, we can recursively calculate hashes
for inner nodes.
was (Author: stuhood):
The current plan is for an AntiEntropyService per table to maintain a Merkle Tree per
column family.
The tree will be a implemented as a randomized binary tree in memory (a ''Treap'': http://en.wikipedia.org/wiki/Treap),
where every item in the tree represents a range bounded by the dht.Tokens of its left and
right neighbors. By placing a bound on the total number of nodes in the tree, we can limit
the memory usage. We can compact or split ranges in the tree by removing or adding Tokens.
The algorithm for deciding which ranges to compact/split will be described below.
When a write comes in for a given table, we will place 'invalidation' operations in a queue
for all affected column families. The ExecutorService for the table will read from the queue
and perform the 'invalidations' as fast as it can. For a given Key/Token, if any column family
tree is marked as 'invalid', the entire row needs to be read from disk and repaired (at some
point in the future).
An 'invalidation' operation does a binary search in the Merkle Tree and marks the matching
range as 'invalid', deleting its hash. We will also take advantage of this step to optimize
the tree: A ''Treap'' stores a random priority (P) on each node, and by generating a random
P' and replacing P for a node iff P' < P as we invalidate it, more frequently invalidated
ranges will shift to the top of the tree.
The AEService maintaining the tree for a table will occasionally need to exchange portions
of the tree with other nodes. In order to do this, subtrees that both nodes are interested
in from all CF trees will have to be locked long enough to recalculate all 'invalid' children,
and then the locks can flow down the tree as progressively smaller ranges are exhanged. Doing
this locking efficiently is going to be interesting (aka: I haven't thought about it).
Implementing the exchange between nodes is blocked by CASSANDRA242 because in order to align
the subtrees on different nodes, we need to be able to deterministically split two ranges.
In order to fill in 'invalid' ranges in the tree, the MerkleTree will provide an operation
that builds a list of invalid ranges to be fetched from disk. During this step, we can also
compact/split ranges. Because of our Treap maintenance, frequently invalidated ranges will
be near the top of the tree, and stable ranges will be closer to the leaves. By compacting
the deepest N nodes and expanding the shallowest N, we can minimizing the size of the ranges
that are affected by invalidations in the future.
Given the list of 'invalid' ranges (and pointers directly to the tree nodes), the AEService
will fetch the ranges from the current MemTable and SSTables for the CF, hash them, and store
the hashes into the relevant nodes. After this operation, we can recursively calculate hashes
for inner nodes.
> Proactive repair
> 
>
> Key: CASSANDRA193
> URL: https://issues.apache.org/jira/browse/CASSANDRA193
> Project: Cassandra
> Issue Type: New Feature
> Reporter: Jonathan Ellis
> Assignee: Stu Hood
> Fix For: 0.5
>
>
> Currently cassandra supports "read repair," i.e., lazy repair when a read is done. This
is better than nothing but is not sufficient for some cases (e.g. catastrophic node failure
where you need to rebuild all of a node's data on a new machine).
> Dynamo uses merkle trees here. This is harder for Cassandra given the CF data model
but I suppose we could just hash the serialized CF value.

This message is automatically generated by JIRA.

You can reply to this email to add a comment to the issue online.
