Return-Path: Delivered-To: apmail-incubator-cassandra-dev-archive@minotaur.apache.org Received: (qmail 24556 invoked from network); 1 Dec 2009 04:02:18 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 1 Dec 2009 04:02:18 -0000 Received: (qmail 28568 invoked by uid 500); 1 Dec 2009 04:02:18 -0000 Delivered-To: apmail-incubator-cassandra-dev-archive@incubator.apache.org Received: (qmail 28474 invoked by uid 500); 1 Dec 2009 04:02:16 -0000 Mailing-List: contact cassandra-dev-help@incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: cassandra-dev@incubator.apache.org Delivered-To: mailing list cassandra-dev@incubator.apache.org Received: (qmail 28464 invoked by uid 99); 1 Dec 2009 04:02:15 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Dec 2009 04:02:15 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rosvopaallikko@gmail.com designates 209.85.160.41 as permitted sender) Received: from [209.85.160.41] (HELO mail-pw0-f41.google.com) (209.85.160.41) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Dec 2009 04:02:07 +0000 Received: by pwj21 with SMTP id 21so126902pwj.20 for ; Mon, 30 Nov 2009 20:01:45 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type :content-transfer-encoding; bh=KGn5cMwTgdkudSnbvuxw8AVrPI9xB44ApRDhHFU6Tj8=; b=DdmnMHc6DalE65jK6nZPJT6GwOoCKaBl1+vZZ4csfQgBPaONkSmojKYMEK4G1MnW+o PtMAgqRbeH60YOsD5SxMrXow0ataBQvJWdad5ui39gq/S8gCdD4RDhMuMm9U0nijNRLB yyVMzerBrHEgo95L25Ov1/AZrpcw1ALaseMtg= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type:content-transfer-encoding; b=dghCX+pZmimAwahVa3MqMUkXuR/G1yW/0vp8AJiaREDE3eHbwYz4SWzXYKA0PmxEh+ MuUKL2fJdws6/SCx2ajzDsneN7Ae5vl5nDcpv+eYTCDhDljn5SLRNPV08qgoLPsJ3iP1 gQtHb0Q4K6ipzBESaBoMfcgR9N1KPG2HoykSc= MIME-Version: 1.0 Received: by 10.115.27.14 with SMTP id e14mr1553664waj.116.1259640101939; Mon, 30 Nov 2009 20:01:41 -0800 (PST) In-Reply-To: References: Date: Tue, 1 Dec 2009 13:01:41 +0900 Message-ID: Subject: Re: Bootstrap in rack-aware mode From: Jaakko To: cassandra-dev@incubator.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org I think adding nodes to the ring (that is, bootstrapping) is safe even if there are multiple such operations happening at the same time. Bootstrapping causes existing ranges to be fragmented and if we count pending ranges based on old token metadata, the worst that can happen (I think) is that pending ranges will be too big and we get data that is not our responsibility once the other bootstraps finish. I could not find any scenario where pending ranges did not cover node's eventual ranges. However, nodes booting or leaving when another node is leaving within the same affected ranges, is another issue entirely (as discussed in #562 and an earlier mail thread. See especially "2nd option" in #562 description). Problem is, in this case actual ranges might eventually be _bigger_ than pending ranges calculated, or be entirely wrong as one node disappears from the ring. This problem is present regardless of replication strategy: Original cluster (primary range, replica range) A: E-A, D-E B: A-B, E-A D: B-D, A-B E: D-E, B-D C bootstraps in between B and D: A: E-A, D-E B: A-B, E-A C: B-C, A-B D: C-D, B-C E: D-E, C-D At the same time B leaves: A: E-A, D-E D: A-D, E-A E: D-E, A-D Actual final situation: A: E-A, D-E C: A-C, E-A D: C-D, A-C E: D-E, C-D It is easy to see that when C bootstraps, it thinks (based on token metadata at that time) that it's primary range will be B-C and replica range A-B. Final situation is A-C / E-A, which is much larger than the node assumed. I think this will need another control channel as gossiping is not able to solve this. Basically we need to control movement in a way that If a node is leaving, there must not be any other movement (leaving or bootstrapping) within the affected ranges. This means that: (1) before a node can leave, it must contact all nodes that are going to experience range changes and get permission from them. If any of the other nodes already has pending range changes either due to another bootstrap or leave operation, the node wanting to leave must wait. (2) before a node can bootstrap, it must get permission from the intended bootstrap source. If there is a leave operation in progress, bootstrap must wait. As also discussed in #562, for automatic load balancing purposes it would be best to limit the amount of moving nodes (within affected ranges again) to one, so basically I think we'll need a movement control channel that makes sure only one move operation is in progress at certain area of the ring. This would of course make autobootstrapping a big cluster a bit slower, as nodes have to queue when bootstrapping, but I don't think this is a big problem. It would make all these dark corner cases go away, and would IMHO be the best thing to do. If this proves to be a problem (somebody wants to move groups of nodes at the same time or something like that), we can then polish the mechanism. -Jaakko On Tue, Dec 1, 2009 at 4:19 AM, Jonathan Ellis wrote: > Do we have a problem from bootstrapping nodes not being aware of each > other in rack-aware replication strategy? > > Background: bootstrap makes the assumption that we can simplify things > by treating bootstrap of multiple nodes independently, trading some > (potential) extra copying for simplifying the process for recovery if > a node fails or is killed during the bootstrap process. > > A couple examples should illustrate this. > > Suppose we have nodes A and D in rack unaware mode, replication factor > of one (for simplicity). =A0The ranges are then (D-A] for A and (A-D] > for D. > > Nodes B and C then bootstrap between A and D. =A0So we copy (A-B] to B > and (A-C] to C. =A0If both bootstraps complete successfully then they > will serve (A-B] and (B-C], that is, we transferred (A-B] to C > unnecessarily. =A0But, if either bootstrap fails, the remaining > bootstrap can ignore that and serve the entire range that was > transferred to it. > > So for rack-unaware bootstrapping it is clear that > bootstrap-in-isolation is fine. =A0But what about rack-aware? > > Recall that in rack-aware mode, we write the first replica to the > first node on the ring _in the other data center_, and remaining > replicas to nodes in the same. > > Say we have two nodes A and D, in different DCs, with a replication > factor of 2: > > A / D > > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-D] > D =A0 =A0 =A0 (A-D] =A0 =A0 =A0 =A0 =A0 =A0(D-A] > > If we add nodes B and C in the same DCs as A and D, respectively, we > bootstrap as > > A,B / C,D > > B predicts the ring will be > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(B-D] > B =A0 =A0 =A0 (A-B] > D =A0 =A0 =A0 (B-D] =A0 =A0 =A0 =A0 =A0 =A0(D-A], (A-B] > > C predicts > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-C], (C-D] > C =A0 =A0 =A0 (A-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A] > D =A0 =A0 =A0 (C-D] > > And really we end up with > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(B-C], (C-D] > B =A0 =A0 =A0 (A-B] > C =A0 =A0 =A0 (B-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A], (A-B] > D =A0 =A0 =A0 (C-D] > > So each node does have (a superset of) the right data copied. =A0(Note > that C has (A-B] as a replica in the final version, whereas it > predicted it would be part of its primary range, but that doesn't > matter as long as it ended up w/ the right data on it.) > > If instead we add B and C both to D's datacenter we have: > > A / B,C,D > > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-B], (B-D] > B =A0 =A0 =A0 (A-B] =A0 =A0 =A0 =A0 =A0 =A0(D-A] > D =A0 =A0 =A0 (B-D] > > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-C], (C-D] > C =A0 =A0 =A0 (A-C] =A0 =A0 =A0 =A0 =A0 =A0(D-A] > D =A0 =A0 =A0 (C-D] > > Node =A0 =A0Primary range =A0 =A0Replica for > A =A0 =A0 =A0 (D-A] =A0 =A0 =A0 =A0 =A0 =A0(A-B], (B-C], (C-D] > B =A0 =A0 =A0 (A-B] =A0 =A0 =A0 =A0 =A0 =A0(D-A] > C =A0 =A0 =A0 (B-C] > D =A0 =A0 =A0 (C-D] > > Again each node ends up with the right data. > > Are there conditions under which we don't? > > After playing around with this in my mind I think that there are not, > but this is tricky so peer review is welcome. :) >