Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 82323 invoked from network); 30 Mar 2009 20:31:50 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 30 Mar 2009 20:31:50 -0000 Received: (qmail 80244 invoked by uid 500); 30 Mar 2009 20:31:50 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 80180 invoked by uid 500); 30 Mar 2009 20:31:49 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 80169 invoked by uid 99); 30 Mar 2009 20:31:49 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2009 20:31:49 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy) Received: from [68.142.207.105] (HELO web32008.mail.mud.yahoo.com) (68.142.207.105) by apache.org (qpsmtpd/0.29) with SMTP; Mon, 30 Mar 2009 20:31:41 +0000 Received: (qmail 7796 invoked by uid 60001); 30 Mar 2009 20:31:20 -0000 DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=yahoo.com; s=s1024; t=1238445080; bh=vrSTpxC44POyvyM0Ro+jLu4JmBTUgyIsh53kGYpcDCA=; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=HfFzfclegzUckCpCPjq5VAfbWjoYv6iTKOxdA0B/icLzCVlnFX2Rhw3VJErmYkLQutenWx5fWAm9niJ9M3u5/qBiBiT7E4RGKJOas8eCxfUblLB9+QcDhQdBqsFw65GQh7ZXxGUavxdMLvlFtsB95mzA561mjcjKsQaLUqmJsTA= DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws; s=s1024; d=yahoo.com; h=Message-ID:X-YMail-OSG:Received:X-Mailer:References:Date:From:Subject:To:In-Reply-To:MIME-Version:Content-Type:Content-Transfer-Encoding; b=xFQIhPH9m3+g3x/UMqX4EbiBvkE0j3dM6rsf7Ku+Qh13VvYEz5Nc8EkRrK7/vZ1OTPWSRKYlZbNnthRx11Ye7IE5k3WrnQJ/41By0eR+j+qPiPumIHajRKBSLLEEmuzMSUp6fFJGt/YEucb58tds9Q0lzC8wOcvTkaj0SNKu1+o=; Message-ID: <439611.7108.qm@web32008.mail.mud.yahoo.com> X-YMail-OSG: zSs4fOUVM1k1FXjCemZwJKTjTHEaUX8pO4Ifl4m1uFCQKWZo0rE5brIFtdy.ARck8gCFk6j1hlw_Zpcpi8KQy28FnQAqr8mSxPbzfED6_XO8lC..LzzbiB058RU90a5.mM67K2fOyAyqfEVUPl49E6ybArliXsuWL2rtJ47xaiGZIoyFip1cEIrKQKGGmdgtZhqJI9p9uVXMgDcRLnlboo4QomE_fsE- Received: from [65.113.40.1] by web32008.mail.mud.yahoo.com via HTTP; Mon, 30 Mar 2009 13:31:20 PDT X-Mailer: YahooMailRC/1277.32 YahooMailWebService/0.7.289.1 References: <24288.10584.qm@web32004.mail.mud.yahoo.com> <49CE632C.3080405@yahoo-inc.com> Date: Mon, 30 Mar 2009 13:31:20 -0700 (PDT) From: "raghul@yahoo.com" Subject: Re: Divergence in ZK transaction logs in some corner cases? To: zookeeper-user@hadoop.apache.org In-Reply-To: <49CE632C.3080405@yahoo-inc.com> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Ben,=0A=0AThanks a lot for explaining this.=0A=0AI have one more corner cas= e in mind where the transaction logs could diverge. I might be wrong this t= ime as well, but would like to understand how it works. Reading the Leader.= lead() code, it seems like the new leader reads the last logged zxid and bu= mps up the higher 32 bits while resetting the lower 32 bits. So this means = that cascading leader crashes without a PROPOSAL in between would make the = new leader chose the same zxid as the one before. This could lead to a corn= er case like below:=0A=0AIn an ensemble of 5 servers (A, B, C, D and E), sa= y the zxid is 1,10 (higher 32 bits, lower 32 bits) with A as the leader. No= w the following events happen:=0A=0A1. A crashes.=0A2. B is elected the lea= der. So the zxid of the ensemble moves to 2,0. If I read the code correctly= , no one logs the new zxid until a new PROPOSAL is made. Now B starts a new= PROPOSAL (2,1), B logs the PROPOSAL and moves to zxid (2,1).=0A3. B crashe= s before anyone else receives the PROPOSAL.=0A4. C is elected as the leader= . Since the new zxid depends on the last logged zxid (which is still 1,10 a= ccording to C's log), the new zxid chosen by C is 2,0 as well.=0A5. Now C s= tarts a new PROPOSAL (2,1), C logs the PROPOSAL and crashes before anyone e= lse has received the PROPOSAL. We have diverged logs in B and C with the sa= me zxid (2,1).=0A=0ACould you tell me if this is correct?=0A=0AThanks=0ARag= hu=0A=0A=0A=0A=0A=0A----- Original Message ----=0AFrom: Benjamin Reed =0ATo: "zookeeper-user@hadoop.apache.org" =0ASent: Saturday, 28 March, 2009 10:49:32=0ASubject: Re: = Divergence in ZK transaction logs in some corner cases?=0A=0Aif recover wor= ked the way you outline, we would have a problem indeed. fortunately, we sp= ecifically address this case.=0A=0Athe problem is in your first step. when = b is elected leader, he will not proposal 10, he will propose 1000000000000= 01. the zxid is made up of two parts, the high order bits are an epoch numb= er and the low order bits are a counter. when every a new leader is elected= , he will increment the epoch number and reset the counter.=0A=0Awhen A res= tarts you have the opposite problem, you need to make sure that A forgets 1= 0 because we have skipped it and committing it will mean that 10 is deliver= ed out of order. we take advantage of the epoch number in that case as well= to make sure that A forgets about 10.=0A=0Athere is some discussion about = this in: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperInternals.= html#sc_atomicBroadcast=0A=0Awe have a presentation as well that i'll put u= p that may make it more clear.=0A=0Aben=0A=0Araghul@yahoo.com wrote:=0A> ZK= gurus,=0A> =0A> I think the ZK transaction logs can diverge from one anoth= er in some corner cases. I have one such corner case listed below, could yo= u please confirm if my understanding is correct?=0A> =0A> Imagine a 5 sreve= r ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is the leader and i= t starts a new PROPOSAL (@zxid 10). A writes the proposal to the log, so A = moves to zxid 10. Others haven't received the PROPOSAL yet and A crashes. N= ow the following happens:=0A> =0A> 1. B is elected as the newleader. B bump= s up its in-mem zxid to 10. Since other nodes are at the same zxid, it send= s a SNAP so that the others can rebuild their data tree. In-memory zxid of = all other nodes moves to 10. =0A> 2. A comes back now, it accepts B as th= e leader as soon as the leader (B) and N/2 other nodes vouch for B as the l= eader. So A joins the ensemble. Every zookeeper node is at zxid 10.=0A> =0A= > 3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and= the cluster moves up to zxid 11. But the transaction log of A is different= from that of everyone else now. So the transaction logs have diverged.=0A>= =0A> Could you confirm if this can happen? Or am I reading the code wrong?= =0A> =0A> Thanks=0A> Raghu=0A> =0A> =0A> =0A=0A=0A