Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (athena.apache.org: local policy)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024; d=yahoo.com;
  h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding;
  b=xFQIhPH9m3+g3x/UMqX4EbiBvkE0j3dM6rsf7Ku+Qh13VvYEz5Nc8EkRrK7/vZ1OTPWSRKYlZbNnthRx11Ye7IE5k3WrnQJ/41By0eR+j+qPiPumIHajRKBSLLEEmuzMSUp6fFJGt/YEucb58tds9Q0lzC8wOcvTkaj0SNKu1+o=;
Message-ID: <439611.7108.qm@web32008.mail.mud.yahoo.com>
References: <24288.10584.qm@web32004.mail.mud.yahoo.com>
 <49CE632C.3080405@yahoo-inc.com>
Date: Mon, 30 Mar 2009 13:31:20 -0700 (PDT)
From: "raghul@yahoo.com" <raghul@yahoo.com>
Subject: Re: Divergence in ZK transaction logs in some corner cases?
To: zookeeper-user@hadoop.apache.org
In-Reply-To: <49CE632C.3080405@yahoo-inc.com>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


Ben,=0A=0AThanks a lot for explaining this.=0A=0AI have one more corner cas=
e in mind where the transaction logs could diverge. I might be wrong this t=
ime as well, but would like to understand how it works. Reading the Leader.=
lead() code, it seems like the new leader reads the last logged zxid and bu=
mps up the higher 32 bits while resetting the lower 32 bits. So this means =
that cascading leader crashes without a PROPOSAL in between would make the =
new leader chose the same zxid as the one before. This could lead to a corn=
er case like below:=0A=0AIn an ensemble of 5 servers (A, B, C, D and E), sa=
y the zxid is 1,10 (higher 32 bits, lower 32 bits) with A as the leader. No=
w the following events happen:=0A=0A1. A crashes.=0A2. B is elected the lea=
der. So the zxid of the ensemble moves to 2,0. If I read the code correctly=
, no one logs the new zxid until a new PROPOSAL is made. Now B starts a new=
 PROPOSAL (2,1), B logs the PROPOSAL and moves to zxid (2,1).=0A3. B crashe=
s before anyone else receives the PROPOSAL.=0A4. C is elected as the leader=
. Since the new zxid depends on the last logged zxid (which is still 1,10 a=
ccording to C's log), the new zxid chosen by C is 2,0 as well.=0A5. Now C s=
tarts a new PROPOSAL (2,1), C logs the PROPOSAL and crashes before anyone e=
lse has received the PROPOSAL. We have diverged logs in B and C with the sa=
me zxid (2,1).=0A=0ACould you tell me if this is correct?=0A=0AThanks=0ARag=
hu=0A=0A=0A=0A=0A=0A----- Original Message ----=0AFrom: Benjamin Reed <bree=
d@yahoo-inc.com>=0ATo: "zookeeper-user@hadoop.apache.org" <zookeeper-user@h=
adoop.apache.org>=0ASent: Saturday, 28 March, 2009 10:49:32=0ASubject: Re: =
Divergence in ZK transaction logs in some corner cases?=0A=0Aif recover wor=
ked the way you outline, we would have a problem indeed. fortunately, we sp=
ecifically address this case.=0A=0Athe problem is in your first step. when =
b is elected leader, he will not proposal 10, he will propose 1000000000000=
01. the zxid is made up of two parts, the high order bits are an epoch numb=
er and the low order bits are a counter. when every a new leader is elected=
, he will increment the epoch number and reset the counter.=0A=0Awhen A res=
tarts you have the opposite problem, you need to make sure that A forgets 1=
0 because we have skipped it and committing it will mean that 10 is deliver=
ed out of order. we take advantage of the epoch number in that case as well=
 to make sure that A forgets about 10.=0A=0Athere is some discussion about =
this in: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperInternals.=
html#sc_atomicBroadcast=0A=0Awe have a presentation as well that i'll put u=
p that may make it more clear.=0A=0Aben=0A=0Araghul@yahoo.com wrote:=0A> ZK=
 gurus,=0A> =0A> I think the ZK transaction logs can diverge from one anoth=
er in some corner cases. I have one such corner case listed below, could yo=
u please confirm if my understanding is correct?=0A> =0A> Imagine a 5 sreve=
r ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is the leader and i=
t starts a new PROPOSAL (@zxid 10). A writes the proposal to the log, so A =
moves to zxid 10. Others haven't received the PROPOSAL yet and A crashes. N=
ow the following happens:=0A> =0A> 1. B is elected as the newleader. B bump=
s up its in-mem zxid to 10. Since other nodes are at the same zxid, it send=
s a SNAP so that the others can rebuild their data tree. In-memory zxid of =
all other nodes moves to 10.  =0A> 2.  A comes back now, it accepts B as th=
e leader as soon as the leader (B) and N/2 other nodes vouch for B as the l=
eader. So A joins the ensemble. Every zookeeper node is at zxid 10.=0A> =0A=
> 3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and=
 the cluster moves up to zxid 11. But the transaction log of A is different=
 from that of everyone else now. So the transaction logs have diverged.=0A>=
 =0A> Could you confirm if this can happen? Or am I reading the code wrong?=
=0A> =0A> Thanks=0A> Raghu=0A> =0A> =0A>        =0A=0A=0A