Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 246 invoked from network); 28 Mar 2009 17:50:13 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 28 Mar 2009 17:50:13 -0000 Received: (qmail 54372 invoked by uid 500); 28 Mar 2009 17:50:13 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 54298 invoked by uid 500); 28 Mar 2009 17:50:13 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 54288 invoked by uid 99); 28 Mar 2009 17:50:13 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2009 17:50:12 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 28 Mar 2009 17:50:02 +0000 Received: from [0.0.0.0] (proxy8.corp.yahoo.com [216.145.48.13]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id n2SHnXuN053912 for ; Sat, 28 Mar 2009 10:49:33 -0700 (PDT) DomainKey-Signature: a=rsa-sha1; s=serpent; d=yahoo-inc.com; c=nofws; q=dns; h=message-id:date:from:user-agent:mime-version:to:subject: references:in-reply-to:content-type:content-transfer-encoding; b=G5ij4l1ajKy4ZttKNljmW1lIxTTxxspFTJQi0FMmJ4TL1Gxrsp4+KjwhUTYgAgXq Message-ID: <49CE632C.3080405@yahoo-inc.com> Date: Sat, 28 Mar 2009 10:49:32 -0700 From: Benjamin Reed User-Agent: Thunderbird 2.0.0.21 (X11/20090319) MIME-Version: 1.0 To: "zookeeper-user@hadoop.apache.org" Subject: Re: Divergence in ZK transaction logs in some corner cases? References: <24288.10584.qm@web32004.mail.mud.yahoo.com> In-Reply-To: <24288.10584.qm@web32004.mail.mud.yahoo.com> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org if recover worked the way you outline, we would have a problem indeed. fortunately, we specifically address this case. the problem is in your first step. when b is elected leader, he will not proposal 10, he will propose 100000000000001. the zxid is made up of two parts, the high order bits are an epoch number and the low order bits are a counter. when every a new leader is elected, he will increment the epoch number and reset the counter. when A restarts you have the opposite problem, you need to make sure that A forgets 10 because we have skipped it and committing it will mean that 10 is delivered out of order. we take advantage of the epoch number in that case as well to make sure that A forgets about 10. there is some discussion about this in: http://hadoop.apache.org/zookeeper/docs/r3.1.1/zookeeperInternals.html#sc_atomicBroadcast we have a presentation as well that i'll put up that may make it more clear. ben raghul@yahoo.com wrote: > ZK gurus, > > I think the ZK transaction logs can diverge from one another in some corner cases. I have one such corner case listed below, could you please confirm if my understanding is correct? > > Imagine a 5 srever ensemble (A,B,C,D,E). All the servers are @ zxid 9. A is the leader and it starts a new PROPOSAL (@zxid 10). A writes the proposal to the log, so A moves to zxid 10. Others haven't received the PROPOSAL yet and A crashes. Now the following happens: > > 1. B is elected as the newleader. B bumps up its in-mem zxid to 10. Since other nodes are at the same zxid, it sends a SNAP so that the others can rebuild their data tree. In-memory zxid of all other nodes moves to 10. > > 2. A comes back now, it accepts B as the leader as soon as the leader (B) and N/2 other nodes vouch for B as the leader. So A joins the ensemble. Every zookeeper node is at zxid 10. > > 3. A new request is submitted to B. B runs PROPOSAL and COMMIT phases and the cluster moves up to zxid 11. But the transaction log of A is different from that of everyone else now. So the transaction logs have diverged. > > Could you confirm if this can happen? Or am I reading the code wrong? > > Thanks > Raghu > > > >