incubator-cassandra-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaakko <rosvopaalli...@gmail.com>
Subject Re: Bootstrap in rack-aware mode
Date Tue, 01 Dec 2009 14:29:00 GMT
Just attached a draft patch to #562 if somebody is interested to have
a look. It outlines how leave coordination could work, although I
would be surprised if the code actually works as it has not been
tested. I'll improve & test this tomorrow.

Any ideas/comments would be welcome, especially on the subject do
people think this approach is the best one or even needed?

Cheers,
-Jaakko


On Tue, Dec 1, 2009 at 1:01 PM, Jaakko <rosvopaallikko@gmail.com> wrote:
> I think adding nodes to the ring (that is, bootstrapping) is safe even
> if there are multiple such operations happening at the same time.
> Bootstrapping causes existing ranges to be fragmented and if we count
> pending ranges based on old token metadata, the worst that can happen
> (I think) is that pending ranges will be too big and we get data that
> is not our responsibility once the other bootstraps finish. I could
> not find any scenario where pending ranges did not cover node's
> eventual ranges.
>
> However, nodes booting or leaving when another node is leaving within
> the same affected ranges, is another issue entirely (as discussed in
> #562 and an earlier mail thread. See especially "2nd option" in #562
> description). Problem is, in this case actual ranges might eventually
> be _bigger_ than pending ranges calculated, or be entirely wrong as
> one node disappears from the ring. This problem is present regardless
> of replication strategy:
>
> Original cluster (primary range, replica range)
> A: E-A, D-E
> B: A-B, E-A
> D: B-D, A-B
> E: D-E, B-D
>
> C bootstraps in between B and D:
> A: E-A, D-E
> B: A-B, E-A
> C: B-C, A-B
> D: C-D, B-C
> E: D-E, C-D
>
> At the same time B leaves:
> A: E-A, D-E
> D: A-D, E-A
> E: D-E, A-D
>
> Actual final situation:
> A: E-A, D-E
> C: A-C, E-A
> D: C-D, A-C
> E: D-E, C-D
>
> It is easy to see that when C bootstraps, it thinks (based on token
> metadata at that time) that it's primary range will be B-C and replica
> range A-B. Final situation is A-C / E-A, which is much larger than the
> node assumed.
>
> I think this will need another control channel as gossiping is not
> able to solve this. Basically we need to control movement in a way
> that If a node is leaving, there must not be any other movement
> (leaving or bootstrapping) within the affected ranges. This means
> that:
>
> (1) before a node can leave, it must contact all nodes that are going
> to experience range changes and get permission from them. If any of
> the other nodes already has pending range changes either due to
> another bootstrap or leave operation, the node wanting to leave must
> wait.
> (2) before a node can bootstrap, it must get permission from the
> intended bootstrap source. If there is a leave operation in progress,
> bootstrap must wait.
>
> As also discussed in #562, for automatic load balancing purposes it
> would be best to limit the amount of moving nodes (within affected
> ranges again) to one, so basically I think we'll need a movement
> control channel that makes sure only one move operation is in progress
> at certain area of the ring.
>
> This would of course make autobootstrapping a big cluster a bit
> slower, as nodes have to queue when bootstrapping, but I don't think
> this is a big problem. It would make all these dark corner cases go
> away, and would IMHO be the best thing to do. If this proves to be a
> problem (somebody wants to move groups of nodes at the same time or
> something like that), we can then polish the mechanism.
>
> -Jaakko
>
>
> On Tue, Dec 1, 2009 at 4:19 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
>> Do we have a problem from bootstrapping nodes not being aware of each
>> other in rack-aware replication strategy?
>>
>> Background: bootstrap makes the assumption that we can simplify things
>> by treating bootstrap of multiple nodes independently, trading some
>> (potential) extra copying for simplifying the process for recovery if
>> a node fails or is killed during the bootstrap process.
>>
>> A couple examples should illustrate this.
>>
>> Suppose we have nodes A and D in rack unaware mode, replication factor
>> of one (for simplicity).  The ranges are then (D-A] for A and (A-D]
>> for D.
>>
>> Nodes B and C then bootstrap between A and D.  So we copy (A-B] to B
>> and (A-C] to C.  If both bootstraps complete successfully then they
>> will serve (A-B] and (B-C], that is, we transferred (A-B] to C
>> unnecessarily.  But, if either bootstrap fails, the remaining
>> bootstrap can ignore that and serve the entire range that was
>> transferred to it.
>>
>> So for rack-unaware bootstrapping it is clear that
>> bootstrap-in-isolation is fine.  But what about rack-aware?
>>
>> Recall that in rack-aware mode, we write the first replica to the
>> first node on the ring _in the other data center_, and remaining
>> replicas to nodes in the same.
>>
>> Say we have two nodes A and D, in different DCs, with a replication
>> factor of 2:
>>
>> A / D
>>
>> Node    Primary range    Replica for
>> A       (D-A]            (A-D]
>> D       (A-D]            (D-A]
>>
>> If we add nodes B and C in the same DCs as A and D, respectively, we
>> bootstrap as
>>
>> A,B / C,D
>>
>> B predicts the ring will be
>> Node    Primary range    Replica for
>> A       (D-A]            (B-D]
>> B       (A-B]
>> D       (B-D]            (D-A], (A-B]
>>
>> C predicts
>> Node    Primary range    Replica for
>> A       (D-A]            (A-C], (C-D]
>> C       (A-C]            (D-A]
>> D       (C-D]
>>
>> And really we end up with
>> Node    Primary range    Replica for
>> A       (D-A]            (B-C], (C-D]
>> B       (A-B]
>> C       (B-C]            (D-A], (A-B]
>> D       (C-D]
>>
>> So each node does have (a superset of) the right data copied.  (Note
>> that C has (A-B] as a replica in the final version, whereas it
>> predicted it would be part of its primary range, but that doesn't
>> matter as long as it ended up w/ the right data on it.)
>>
>> If instead we add B and C both to D's datacenter we have:
>>
>> A / B,C,D
>>
>> Node    Primary range    Replica for
>> A       (D-A]            (A-B], (B-D]
>> B       (A-B]            (D-A]
>> D       (B-D]
>>
>> Node    Primary range    Replica for
>> A       (D-A]            (A-C], (C-D]
>> C       (A-C]            (D-A]
>> D       (C-D]
>>
>> Node    Primary range    Replica for
>> A       (D-A]            (A-B], (B-C], (C-D]
>> B       (A-B]            (D-A]
>> C       (B-C]
>> D       (C-D]
>>
>> Again each node ends up with the right data.
>>
>> Are there conditions under which we don't?
>>
>> After playing around with this in my mind I think that there are not,
>> but this is tricky so peer review is welcome. :)
>>
>

Mime
View raw message