Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id C6B8B10992 for ; Sat, 15 Jun 2013 07:26:00 +0000 (UTC) Received: (qmail 10631 invoked by uid 500); 15 Jun 2013 07:26:00 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 9975 invoked by uid 500); 15 Jun 2013 07:25:51 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 9967 invoked by uid 99); 15 Jun 2013 07:25:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Jun 2013 07:25:49 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of german.blanco.blanco@gmail.com designates 209.85.223.172 as permitted sender) Received: from [209.85.223.172] (HELO mail-ie0-f172.google.com) (209.85.223.172) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 15 Jun 2013 07:25:45 +0000 Received: by mail-ie0-f172.google.com with SMTP id 16so3281657iea.3 for ; Sat, 15 Jun 2013 00:25:25 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=FPPH+JETKBxdbZ0TfcAQx+TZjdYge0b0x6FeU8N3m+c=; b=TayTvZhk+ml7BurVldGYtGGNYwbqxnEiOefXCVGeL44I9jLqRYn/X+38ttgm5zh9Ku jP5HUNcz6tP3XBF5+rn2VUTrfDaxLh+7XrwpkShHV7fa1OzD4ttjgEVWcKj5GHDdQi18 I1h8Du3ESvMBQn7rgWh0BZkmXGo80u4dxwbkR6tWg/QwV147ClhHbjILmrG89tFM4cpg 876zwxibMmh6XX4saHFuyF9M9a6fIIkWryHr6LmUVqn69oxjbRliR94g1V/NEklFIsNU JmlmVQKRY4x2h/4odFZ1x27di+4YCQdFjs5ci5WifIO9Z/l8atiNkch3RxVzHbc3l06q F8zg== MIME-Version: 1.0 X-Received: by 10.50.101.100 with SMTP id ff4mr577238igb.6.1371281124913; Sat, 15 Jun 2013 00:25:24 -0700 (PDT) Received: by 10.50.29.65 with HTTP; Sat, 15 Jun 2013 00:25:24 -0700 (PDT) In-Reply-To: References: <88D886D5-8D95-4C0F-A3F0-F252670D3B80@jordanzimmerman.com> <01ce01ce6932$7b922660$72b67320$@yahoo.com> <4EF64743-67B6-4B1D-B53D-942655C27F90@jordanzimmerman.com> <01d001ce6935$43a086e0$cae194a0$@yahoo.com> <575E4365-85ED-4451-A26B-08D1586D4AD6@jordanzimmerman.com> <01d201ce693f$e79d60b0$b6d82210$@yahoo.com> <01d401ce6941$66bf0af0$343d20d0$@yahoo.com> Date: Sat, 15 Jun 2013 09:25:24 +0200 Message-ID: Subject: Re: Rolling config change considered harmful? From: German Blanco To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=047d7bdc99c05678e404df2c446f X-Virus-Checked: Checked by ClamAV on apache.org --047d7bdc99c05678e404df2c446f Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Thanks a lot Alex! Excellent explanation :-) On Sat, Jun 15, 2013 at 7:01 AM, Alexander Shraer wrote= : > Hi German, > > During normal operation ZK guarantees that a quorum (any majority) of > the ZK ensemble has all operations that may have been committed. > > So without dynamic reconfiguration you should ensure that when you're > changing the ensemble, any possible quorum of the new ensemble > necessarily intersects with any quorum of the 'old' ensemble. > > If you add D and E right away this property may not be guaranteed, > since a new quorum (3 out of 5) can be for example C, D, E. Whereas > its possible that only A and B have the latest state, so it'll be > lost. > > To ensure this when going from 3 to 5 servers you should probably do 2 > transitions. First add > server D. Any majority here, i.e., any 3 servers out of 4, will > necessarily contain 2 servers from the original ensemble (A, B, C). > So at least one server in any new quorum actually has the latest state. > > Then add E. A quorum here is any 3 out of 5 servers, so even if some > quorum includes E and the server from (A, B, C, D) that > didn't have the latest state, we still have 1 server in the quorum > that does have the latest state, so we're fine... > > As long as you add servers one by one and wait for leader election to > complete in every stage, or preserve the quorum intersection property > in some other way, it should be safe. But with dynamic reconfig you > don't need to do that and no reboots necessary of course. > > Alex > > On Fri, Jun 14, 2013 at 9:14 PM, German Blanco > wrote: > > Hello, > > > > Could you please clarify if this thread is about a rolling start in an > > ensemble without the dynamic reconfiguration support? > > And when you say "Create a 5 node ensemble", that means quorum is 5. Bu= t > > then you give server lists with only 3 servers in each node? > > If the server list has 3 servers, then quorum is actually 3 and what is > > described may happen in that scenario. > > In that case C follows B, E follows D and A follows either B or D and > there > > are two working ensembles. > > It should be possible to create problems, even with more standard > > configuration changes: > > If we want to change a quorum of three to a quorum of five {A,B,C} to > > {A,B,C,D,E}: > > - First the configuration is changed in all the nodes, but they are not > > restarted. Only A, B and C are running. > > - One of them is stopped (e.g. A). > > - At this point, if A, D and E are started with the new configuration, > they > > may elect a leader before any of them is aware of either B or C, form a= n > > ensemble and start serving txns. > > - However, if A is started, we wait until it connects to the leader of = B > > and C, and then D and E are started and then B and C are restarted, > > everything should be ok. The fact that this depends on the human abilit= y > to > > start D and E while A,B and C are connected to the ensemble seems a bit > > risky though. > > I have found a presentation on the topic: > > > http://www.slideshare.net/Hadoop_Summit/dynamic-reconfiguration-of-zookee= per > > > > If anybody knows of a safer way to change a quorum of 3 to a quorum of = 5 > > with e.g. zookeeper 3.4.5, please point it out. > > > > Regards, > > > > Germ=E1n. > > > > > > On Fri, Jun 14, 2013 at 11:46 PM, Jordan Zimmerman < > > jordan@jordanzimmerman.com> wrote: > > > >> I got the test cluster into the state described with 2 leaders. I then > >> allocated 100 Curator clients to write nodes "/n" where n is the index > >> (i.e. "/0", "/1", =85). The idea that the nodes would be distributed > around > >> the cluster instances. I then allocated a single Curator instance > dedicated > >> to one of the servers instance, did a sync, and did an exists() to > verify > >> that each cluster instances had all the nodes. For the 2 leader cluste= r, > >> this fails. > >> > >> -JZ > >> > >> On Jun 14, 2013, at 1:54 PM, "FPJ" wrote: > >> > >> > I messed up the last sentence, here is what I was trying to say: > >> > > >> > It is ok to have two servers thinking they are leaders as long as on= ly > >> one > >> > is > >> > able to commit txns at a time by having a quorum of supporters. Each > >> server > >> > is going to follow a single leader, so I don't see a problem in your > >> > scenario > >> > with the information you provided. Now if you tell me that when you > keep > >> > sending new transactions to those leaders, both keep committing new > >> > transactions (not the same txns), then we have a problem. I don't se= e > how > >> > this can happen, though. > >> > > >> > Also, one of the leaders should eventually time out and go back to > leader > >> > election. > >> > > >> >> -----Original Message----- > >> >> From: FPJ [mailto:fpjunqueira@yahoo.com] > >> >> Sent: 14 June 2013 21:44 > >> >> To: user@zookeeper.apache.org > >> >> Subject: RE: Rolling config change considered harmful? > >> >> > >> >> It is ok to have two servers thinking they are leaders as long as > only > >> one > >> > is > >> >> able to commit txns at a time by having a quorum of supporters. Eac= h > >> > server > >> >> is going to follow a single leader, so I don't see a problem in you= r > >> > scenario > >> >> with the information you provided. Now if you tell me that when you > keep > >> >> sending new transactions to those leaders and they keep committing > them > >> >> forever, both keep committing new transactions, then we have a > problem. > >> I > >> >> don't see how this can happen, though. > >> >> > >> >> Also, one of the leaders should eventually time out and go back to > >> leader > >> >> election. > >> >> > >> >> -Flavio > >> >> > >> >>> -----Original Message----- > >> >>> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] > >> >>> Sent: 14 June 2013 21:10 > >> >>> To: user@zookeeper.apache.org > >> >>> Subject: Re: Rolling config change considered harmful? > >> >>> > >> >>> More on this. > >> >>> > >> >>> I just did some testing with wholly contrived scenarios and I was > able > >> >>> to > >> >> get a > >> >>> cluster in a state where it had two leaders. NOTE: all of this was > >> >>> done > >> >> with > >> >>> Curator's TestingCluster > >> >>> > >> >>> * Create a 5 node ensemble > >> >>> * Save the list of instances, directories etc. > >> >>> * Wait for quorum > >> >>> * Shut down the cluster > >> >>> * Restart the ensemble with the same ports and directories. Howeve= r, > >> >>> this time, give different server lists to each instance: > >> >>> * Instance A -> A D E > >> >>> * Instance B -> A B C > >> >>> * Instance C -> A B C > >> >>> * Instance D -> A D E > >> >>> * Instance E -> A D E > >> >>> > >> >>> There is at least one common server amongst all of them. When I > >> >>> restart > >> >> the > >> >>> cluster with this configuration I ended up with two leaders. This > >> >>> state > >> >> stays > >> >>> consistent after leader election (i.e. it doesn't try to re-elect)= . > >> >>> > >> >>> A: following > >> >>> B: leading > >> >>> C: following > >> >>> D: leading > >> >>> E: following > >> >>> > >> >>> This may be the correct behavior. i.e. it may be that ZooKeeper > cannot > >> >>> realistically run in this scenario. What it means to me is that > >> >>> rolling > >> >> config > >> >>> changes, if too lax, can create chaos. > >> >>> > >> >>> -Jordan > >> >>> > >> >>> On Jun 14, 2013, at 12:27 PM, "FPJ" wrote: > >> >>> > >> >>>> In the case I described, the txn is not reflected in the zookeepe= r > >> >> state. > >> >>>> Say T is a create txn. Once C is elected, it determines the initi= al > >> >>>> history of txns for the new epoch that is starting and this initi= al > >> >>>> history is not going to include T. > >> >>>> > >> >>>> In the example below, I was ignoring the client that triggered T, > >> >>>> but since it has been acked by a quorum, the client might as well > >> >>>> have received the confirmation of the operation and think that th= e > >> >>>> znode has > >> >>> been created. > >> >>>> > >> >>>> -Flavio > >> >>>> > >> >>>>> -----Original Message----- > >> >>>>> From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] > >> >>>>> Sent: 14 June 2013 20:16 > >> >>>>> To: user@zookeeper.apache.org > >> >>>>> Subject: Re: Rolling config change considered harmful? > >> >>>>> > >> >>>>> Yes - save that I'm not sure what happens with a client when a > >> >>>>> transaction > >> >>>> is > >> >>>>> lost. What is the error to the client? Or are you referring to > >> >>>>> internal transactions as part of the leader election? > >> >>>>> > >> >>>>> -JZ > >> >>>>> > >> >>>>> On Jun 14, 2013, at 12:07 PM, "FPJ" > wrote: > >> >>>>> > >> >>>>>> Not sure if this helps but here is an example: > >> >>>>>> > >> >>>>>> - Txn T is acknowledged by A and B (ensemble is {A, B, C}) > >> >>>>>> - Ensemble changes to {B, C, D} > >> >>>>>> - C and D form a quorum and elect C because it has the highest > zxid. > >> >>>>>> > >> >>>>>> C won't have T, so the txn gets lost. > >> >>>>>> > >> >>>>>> Does it make sense? > >> >>>>>> > >> >>>>>> -Flavio > >> >>>> > >> >>>> > >> >> > >> > > >> > > >> > >> > --047d7bdc99c05678e404df2c446f--