cassandra-commits mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stu Hood (JIRA)" <>
Subject [jira] Commented: (CASSANDRA-193) Proactive repair
Date Wed, 11 Nov 2009 19:38:39 GMT


Stu Hood commented on CASSANDRA-193:

Thanks a lot for the review Jun! I'll respond to #1 in a separate comment, since it is a design
issue that still needs a lot of discussion.

> 2
> 2.1 In MerkleTree.Node.insert, why do you increment the depth of the left child even
when the node doesn't split?
Node.insert() is only used during split operations (perhaps it is misnamed... but this is
not quite a traditional B-Tree). The child to the left is the node that contains the Token
we were splitting on, and whenever we split a range we increment its depth to indicate how
far it is from being a complete (0,0] range. I'll add a comment to this effect.

> 2.2 In the same function, if the node does split, where is the code to shrink the children
list in the splitted node to half?
Node.insert() uses List.subList() and List.clear() inline to take half of its neighbor's children.

> 2.3 In the same function, do you have to keep calling invalidate during insertion?
The original design assumed that the tree was going to live for a while in memory, and be
maintained between repair sessions, so the split operation is intended to be used on a tree
that might be partially valid. We might be able to have the initial building of the tree skip
this check somehow, but I don't think the 'hash == null? hash = null;' check is too intensive.

> 3
> 3.1 In Validator.add, there is comment about generating a new range. However, no code
does that.
The "private MerkleTree.TreeRangeIterator ranges" variable is an iterator generated by the
MerkleTree: it iterates in order over all of the ranges in the tree that have null hashes.

> 3.2 In TreeRange.validateHelper [...] Why do you have to compute multiple hash values
The reasoning here is that a MerkleTree is supposed to be a sparse representation of a 'perfect/complete'
binary tree. Each leaf of the perfect tree represents a hashed range, and each inner node
represents a binary hash of its two children. The perfect tree is of depth "hashdepth", so
when validateHelper() reaches the maximum/hashdepth, it is in one of the leaves of the perfect
tree, and rows are hashed sequentially there. If a single call to validate() starts in a TreeRange
that is at depth == maxdepth, then what is stored in the MkT.Node is the value of a perfect
leaf, otherwise, the MkT.Node is storing the value of a perfect inner node.

I'll probably copy and past this exact explanation into the next revision =x

> 4. I need some text description to really follow the Differencer code.
I'll make sure this gets in the next revision, but basically:
 1. It recurses using midpoint as long as both trees have the resolution to continue, and
are not equal.
 2a. If it finds a range that has only one invalid child, it adds that child range, since
that is the smallest possible invalid range contained in the parent.
 2b. Otherwise, if both children are invalid (and since it can't recurse deeper), the parent
range is entirely invalid, and recursion keeps rolling up until 2a is met.

> 5. The Hashable class is confusing.
You're right: I definitely should have worried more about overloading the word "hash": I'll
see what I can do about this one.

> 6. The repair logic is missing in Differencer.
The repair logic is the one FIXME for this ticket: CASSANDRA-520 deals with implementing the
actual repair logic.

Thanks again for taking time out to look at this! 

> Proactive repair
> ----------------
>                 Key: CASSANDRA-193
>                 URL:
>             Project: Cassandra
>          Issue Type: New Feature
>          Components: Core
>            Reporter: Jonathan Ellis
>            Assignee: Stu Hood
>             Fix For: 0.5
>         Attachments: 193-1-tree-preparation.diff, 193-2-tree.diff, 193-3-aes-preparation.diff,
> Currently cassandra supports "read repair," i.e., lazy repair when a read is done.  This
is better than nothing but is not sufficient for some cases (e.g. catastrophic node failure
where you need to rebuild all of a node's data on a new machine).
> Dynamo uses merkle trees here.  This is harder for Cassandra given the CF data model
but I suppose we could just hash the serialized CF value.

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message