Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: cassandra-dev@incubator.apache.org
Received-SPF: pass (nike.apache.org: domain of rosvopaallikko@gmail.com
 designates 209.85.160.41 as permitted sender)
DomainKey-Signature: a=rsa-sha1; c=nofws;
        d=gmail.com; s=gamma;
        h=mime-version:in-reply-to:references:date:message-id:subject:from:to
         :content-type:content-transfer-encoding;
        b=dghCX+pZmimAwahVa3MqMUkXuR/G1yW/0vp8AJiaREDE3eHbwYz4SWzXYKA0PmxEh+
         MuUKL2fJdws6/SCx2ajzDsneN7Ae5vl5nDcpv+eYTCDhDljn5SLRNPV08qgoLPsJ3iP1
         gQtHb0Q4K6ipzBESaBoMfcgR9N1KPG2HoykSc=
MIME-Version: 1.0
In-Reply-To: <e06563880911301119x5d89a260m418122efa9986232@mail.gmail.com>
References: <e06563880911301119x5d89a260m418122efa9986232@mail.gmail.com>
Date: Tue, 1 Dec 2009 13:01:41 +0900
Message-ID: <fc97bb180911302001o1a4aaa5i9f979d18ee2eca70@mail.gmail.com>
Subject: Re: Bootstrap in rack-aware mode
From: Jaakko <rosvopaallikko@gmail.com>
To: cassandra-dev@incubator.apache.org
Content-Type: text/plain; charset=ISO-8859-1
Content-Transfer-Encoding: quoted-printable

I think adding nodes to the ring (that is, bootstrapping) is safe even
if there are multiple such operations happening at the same time.
Bootstrapping causes existing ranges to be fragmented and if we count
pending ranges based on old token metadata, the worst that can happen
(I think) is that pending ranges will be too big and we get data that
is not our responsibility once the other bootstraps finish. I could
not find any scenario where pending ranges did not cover node's
eventual ranges.

However, nodes booting or leaving when another node is leaving within
the same affected ranges, is another issue entirely (as discussed in
#562 and an earlier mail thread. See especially "2nd option" in #562
description). Problem is, in this case actual ranges might eventually
be _bigger_ than pending ranges calculated, or be entirely wrong as
one node disappears from the ring. This problem is present regardless
of replication strategy:

Original cluster (primary range, replica range)
A: E-A, D-E
B: A-B, E-A
D: B-D, A-B
E: D-E, B-D

C bootstraps in between B and D:
A: E-A, D-E
B: A-B, E-A
C: B-C, A-B
D: C-D, B-C
E: D-E, C-D

At the same time B leaves:
A: E-A, D-E
D: A-D, E-A
E: D-E, A-D

Actual final situation:
A: E-A, D-E
C: A-C, E-A
D: C-D, A-C
E: D-E, C-D

It is easy to see that when C bootstraps, it thinks (based on token
metadata at that time) that it's primary range will be B-C and replica
range A-B. Final situation is A-C / E-A, which is much larger than the
node assumed.

I think this will need another control channel as gossiping is not
able to solve this. Basically we need to control movement in a way
that If a node is leaving, there must not be any other movement
(leaving or bootstrapping) within the affected ranges. This means
that:

(1) before a node can leave, it must contact all nodes that are going
to experience range changes and get permission from them. If any of
the other nodes already has pending range changes either due to
another bootstrap or leave operation, the node wanting to leave must
wait.
(2) before a node can bootstrap, it must get permission from the
intended bootstrap source. If there is a leave operation in progress,
bootstrap must wait.

As also discussed in #562, for automatic load balancing purposes it
would be best to limit the amount of moving nodes (within affected
ranges again) to one, so basically I think we'll need a movement
control channel that makes sure only one move operation is in progress
at certain area of the ring.

This would of course make autobootstrapping a big cluster a bit
slower, as nodes have to queue when bootstrapping, but I don't think
this is a big problem. It would make all these dark corner cases go
away, and would IMHO be the best thing to do. If this proves to be a
problem (somebody wants to move groups of nodes at the same time or
something like that), we can then polish the mechanism.

-Jaakko


On Tue, Dec 1, 2009 at 4:19 AM, Jonathan Ellis <jbellis@gmail.com> wrote:
> Do we have a problem from bootstrapping nodes not being aware of each
> other in rack-aware replication strategy?
>
> Background: bootstrap makes the assumption that we can simplify things
> by treating bootstrap of multiple nodes independently, trading some
> (potential) extra copying for simplifying the process for recovery if
> a node fails or is killed during the bootstrap process.
>
> A couple examples should illustrate this.
>
> Suppose we have nodes A and D in rack unaware mode, replication factor
> of one (for simplicity). =A0The ranges are then (D-A] for A and (A-D]
> for D.
>
> Nodes B and C then bootstrap between A and D. =A0So we copy (A-B] to B
> and (A-C] to C. =A0If both bootstraps complete successfully then they
> will serve (A-B] and (B-C], that is, we transferred (A-B] to C
> unnecessarily. =A0But, if either bootstrap fails, the remaining
> bootstrap can ignore that and serve the entire range that was
> transferred to it.
>
> So for rack-unaware bootstrapping it is clear that
> bootstrap-in-isolation is fine. =A0But what about rack-aware?
>
> Recall that in rack-aware mode, we write the first replica to the
> first node on the ring _in the other data center_, and remaining
> replicas to nodes in the same.
>
> Say we have two nodes A and D, in different DCs, with a replication
> factor of 2:
>
> A / D
>
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-D]
> D =A0 =A0 =A0 (A-D] =A0 =A0 =A0 =A0 =A0 =A0(D-A]
>
> If we add nodes B and C in the same DCs as A and D, respectively, we
> bootstrap as
>
> A,B / C,D
>
> B predicts the ring will be
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(B-D]
> B =A0 =A0 =A0 (A-B]
> D =A0 =A0 =A0 (B-D] =A0 =A0 =A0 =A0 =A0 =A0(D-A], (A-B]
>
> C predicts
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-C], (C-D]
> C =A0 =A0 =A0 (A-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A]
> D =A0 =A0 =A0 (C-D]
>
> And really we end up with
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(B-C], (C-D]
> B =A0 =A0 =A0 (A-B]
> C =A0 =A0 =A0 (B-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A], (A-B]
> D =A0 =A0 =A0 (C-D]
>
> So each node does have (a superset of) the right data copied. =A0(Note
> that C has (A-B] as a replica in the final version, whereas it
> predicted it would be part of its primary range, but that doesn't
> matter as long as it ended up w/ the right data on it.)
>
> If instead we add B and C both to D's datacenter we have:
>
> A / B,C,D
>
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-B], (B-D]
> B =A0 =A0 =A0 (A-B] =A0 =A0 =A0 =A0 =A0 =A0(D-A]
> D =A0 =A0 =A0 (B-D]
>
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-C], (C-D]
> C =A0 =A0 =A0 (A-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A]
> D =A0 =A0 =A0 (C-D]
>
> Node =A0 =A0Primary range =A0 =A0Replica for
> A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-B], (B-C], (C-D]
> B =A0 =A0 =A0 (A-B] =A0 =A0 =A0 =A0 =A0 =A0(D-A]
> C =A0 =A0 =A0 (B-C]
> D =A0 =A0 =A0 (C-D]
>
> Again each node ends up with the right data.
>
> Are there conditions under which we don't?
>
> After playing around with this in my mind I think that there are not,
> but this is tricky so peer review is welcome. :)
>