From "Sylvain Lebresne (JIRA)" <>
Subject [jira] [Commented] (CASSANDRA-5972) Reduce the amount of data to be transferred during repair
Date Tue, 03 Sep 2013 10:20:51 GMT


Sylvain Lebresne commented on CASSANDRA-5972:

To be clear, this is not exactly the same idea than CASSANDRA-3200 if you read the descriptions.
However, I don't see how we can make the idea of this ticket work without requiring the same
complexity than in CASSANDRA-3200, at which point is think both solution are basically equivalent.

Let me explain what I mean. Consider, for instance, 3 nodes A, B and C and some token range
and consider 2 sub-ranges R1 and R2 (by sub-range I mean a MerkleTree hash) with the following
  A : R1=0, R2=0
  B : R1=1, R2=0
  C : R1=1, R2=1
and suppose that the up-to-date value for R1 and R2 is 1 (so C is fully up to date).  Now,
building the merkle tree doesn't tell us who is more up to date, it only gives us the subranges
on which 2 node differs, so R1 for (A,B), R2 for (B,C) and R1,R2 for (A,C). So if we were
to do A->B->C transfering only the minimum ranges that differs between each pair of
nodes, A would transfer R1 to B and B would transfer R2 to C, which would change nothing.
Then we would do C->B->A: C would transfer R2 to B and B would transfer R1 to A. At
the end, while B would be fully repaired, A wouldn't, it would still have R2=0.

Of course, if we were to order the chain differently and if we were doing A->C->B and
B->C->A, then we would have all node in sync, but I don't think we can decide that in
general without doing low-level comparison between all the trees at the sub-range level, but
doing so is exactly the difficulty of CASSANDRA-3200.

I'll note that another solution that don't require sub-range level analysis would be to transfer
take the union of all sub-range that differs between any 2 nodes and always transfer that,
i.e. in my example above to tansfer both R1 and R2 between A and B (even though the merkle
tree had told us they don't differ initially on that sub-range), but doing so would potentially
yield a lot more transfer than we currently do.

> Reduce the amount of data to be transferred during repair
> ---------------------------------------------------------
>                 Key: CASSANDRA-5972
>                 URL:
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Core
>            Reporter: Jacek Lewandowski
>            Priority: Minor
> Currently, when a validator finds a token range different in n replicas, data streams
are initiated simultaneously between each possible pair of these n nodes, in both directions.
It yields n*(n-1) data stream in total. 
> It can be done in a sequence - R(1) -> R(2), R(2) -> R(3), ... , R(n-1) -> R(n).
After this process, the data in R(n) are up to date. Then, we continue: R(n) -> R(1), R(1)
-> R(2), ... , R(n-2) -> R(n-1). The active repair is done after 2*(n-1) data transfers
performed sequentially in 2*(n-1) steps.

