Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 66989 invoked from network); 29 Jan 2010 02:02:45 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jan 2010 02:02:45 -0000 Received: (qmail 68780 invoked by uid 500); 29 Jan 2010 02:02:44 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 68728 invoked by uid 500); 29 Jan 2010 02:02:43 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 68718 invoked by uid 99); 29 Jan 2010 02:02:43 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 02:02:43 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of yeqian.zju@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-iw0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 02:02:33 +0000 Received: by iwn3 with SMTP id 3so766361iwn.23 for ; Thu, 28 Jan 2010 18:02:13 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=ClncGc9atsfOtWhyllEmMDMAP7uhMeENsCdIVjQdcYw=; b=NXCKUlZ4lSYojNamHUZE/d2QSnbKQm2eBL6N2qrbaECsNvpYeWULTDUjKX69Vf4HPQ 3eSFleXMY6xsJWqWT9iVvWIFQ5xWVmmTt2aRxB2dSzSfh6mHsPUBe5QSdwzW1d0WTXNN Ft80JbPeqEUCAkaWOtnfdiGJT1GlW2Z+VE/LM= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=pxVhH1yNNRyv3ZWU0nPP02mZCZgKAfC4PEZKNi4BDe+eXGSrmGyVopXSlhCxIbBW8R fthFgI2D5O/3qznXOKuczIOG/J0xdNLGFoqccSYzk8fTKEKnaU8AkLZ1awnnV3e3htVg eGIRjZAjGRIaWcNpt3NbIEu8niyoOsP72p5WI= MIME-Version: 1.0 Received: by 10.231.170.136 with SMTP id d8mr271056ibz.17.1264730533156; Thu, 28 Jan 2010 18:02:13 -0800 (PST) In-Reply-To: <4B61C4B9.10404@yahoo-inc.com> References: <4B61C4B9.10404@yahoo-inc.com> Date: Fri, 29 Jan 2010 10:02:13 +0800 Message-ID: Subject: Re: Q about ZK internal: how commit is being remembered From: Qian Ye To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e68ddefe2a59d4047e4406b6 --0016e68ddefe2a59d4047e4406b6 Content-Type: text/plain; charset=UTF-8 Thanks henry and ben, actually I have read the paper henry mentioned in this mail, but I'm still not so clear with some of the details. Anyway, maybe more study on the source code can help me understanding. Since Ben said that, "if less than a quorum of servers have accepted a transaction, we can commit or discard". Would this feature cause any unexpected problem? Can you give some hints about this issue? On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed wrote: > henry is correct. just to state another way, Zab guarantees that if a > quorum of servers have accepted a transaction, the transaction will commit. > this means that if less than a quorum of servers have accepted a > transaction, we can commit or discard. the only constraint we have in > choosing is ordering. we have to decide which partially accepted > transactions are going to be committed and which discarded before we propose > any new messages so that ordering is preserved. > > ben > > > Henry Robinson wrote: > >> Hi - >> >> Note that a machine that has the highest received zxid will necessarily >> have >> seen the most recent transaction that was logged by a quorum of followers >> (the FIFO property of TCP again ensures that all previous messages will >> have >> been seen). This is the property that ZAB needs to preserve. The idea is >> to >> avoid missing a commit that went to a node that has since failed. >> >> I was therefore slightly imprecise in my previous mail - it's possible for >> only partially-proposed proposals to be committed if the leader that is >> elected next has seen them. Only when another proposal is committed >> instead >> must the original proposal be discarded. >> >> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the >> subject, for those with portal.acm.org access: >> http://portal.acm.org/citation.cfm?id=1529978 >> >> Henry >> >> On 27 January 2010 21:52, Qian Ye wrote: >> >> >> >>> Hi Henry: >>> >>> According to your explanation, "*ZAB makes the guarantee that a proposal >>> which has been logged by >>> a quorum of followers will eventually be committed*" , however, the >>> source >>> code of Zookeeper, the FastLeaderElection.java file, shows that, in the >>> election, the candidates only provide their zxid in the votes, the one >>> with >>> the max zxid would win the election. I mean, it seems that no check has >>> been >>> made to make sure whether the latest proposal has been logged by a quorum >>> of >>> servers. >>> >>> In this situation, the zookeeper would deliver a proposal, which is known >>> as >>> a failed one by the client. Imagine this scenario, a zookeeper cluster >>> with >>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout, >>> the >>> client is told that the proposal failed. At this time, all servers >>> restart >>> due to a power failure. The server have the log of proposal A would be >>> the >>> leader, however, the client is told the proposal A failed. >>> >>> Do I misunderstand this? >>> >>> >>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson >>> wrote: >>> >>> >>> >>>> Qing - >>>> >>>> That part of the documentation is slightly confusing. The elected leader >>>> must have the highest zxid that has been written to disk by a quorum of >>>> followers. ZAB makes the guarantee that a proposal which has been logged >>>> >>>> >>> by >>> >>> >>>> a quorum of followers will eventually be committed. Conversely, any >>>> proposals that *don't* get logged by a quorum before the leader sending >>>> them >>>> dies will not be committed. One of the ZAB papers covers both these >>>> situations - making sure proposals are committed or skipped at the right >>>> moments. >>>> >>>> So you get the neat property that leader election can be live in exactly >>>> the >>>> case where the ZK cluster is live. If a quorum of peers aren't available >>>> >>>> >>> to >>> >>> >>>> elect the leader, the resulting cluster won't be live anyhow, so it's ok >>>> for >>>> leader election to fail. >>>> >>>> FLP impossibility isn't actually strictly relevant for ZAB, because FLP >>>> requires that message reordering is possible (see all the stuff in that >>>> paper about non-deterministically drawing messages from a potentially >>>> deliverable set). TCP FIFO channels don't reorder, so provide the extra >>>> signalling that ZAB requires. >>>> >>>> cheers, >>>> Henry >>>> >>>> 2010/1/26 Qing Yan >>>> >>>> >>>> >>>>> Hi, >>>>> >>>>> I have question about how zookeeper *remembers* a commit operation. >>>>> >>>>> According to >>>>> >>>>> >>>>> >>>>> >>>> >>> http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_summary >>> >>> >>>> >>>>> >>>>> >>>>> The leader will issue a COMMIT to all followers as soon as a quorum of >>>>> followers have ACKed a message. Since messages are ACKed in order, >>>>> >>>>> >>>> COMMITs >>>> >>>> >>>>> will be sent by the leader as received by the followers in order. >>>>> >>>>> COMMITs are processed in order. Followers deliver a proposals message >>>>> >>>>> >>>> when >>>> >>>> >>>>> that proposal is committed. >>>>> >>>>> >>>>> My question is will leader wait for COMMIT to be processed by quorum >>>>> of followers before consider >>>>> COMMIT to be success? From the documentation it seems that leader >>>>> >>>>> >>>> handles >>> >>> >>>> COMMIT asynchronously and >>>>> don't expect confirmation from followers. In the extreme case, what >>>>> >>>>> >>>> happens >>>> >>>> >>>>> if leader issue a COMMIT >>>>> to all followers and crash immediately before the COMMIT message can go >>>>> >>>>> >>>> out >>>> >>>> >>>>> of the network. How the system >>>>> remembers the COMMIT ever happens? >>>>> >>>>> Actually this is related to the leader election process: >>>>> >>>>> >>>>> ZooKeeper messaging doesn't care about the exact method of electing a >>>>> leader >>>>> has long as the following holds: >>>>> >>>>> - >>>>> >>>>> The leader has seen the highest zxid of all the followers. >>>>> - >>>>> >>>>> A quorum of servers have committed to following the leader. >>>>> >>>>> Of these two requirements only the first, the highest zxid amoung the >>>>> followers needs to hold for correct operation. >>>>> >>>>> >>>>> >>>>> Is there a liveness issue try to find "The leader has seen the highest >>>>> >>>>> >>>> zxid >>>> >>>> >>>>> of all the followers"? What if some of the followers (which happens to >>>>> holding the highest zxid) cannot be contacted(FLP impossible result?) >>>>> It will be more striaghtforward if COMMIT requires confirmation from a >>>>> quorum of the followers. But I guess things get >>>>> optimized according to Zab's FIFO nature...just want to hear some >>>>> clarification about it. >>>>> >>>>> Thanks alot! >>>>> >>>>> >>>>> >>>> >>> -- >>> With Regards! >>> >>> Ye, Qian >>> Made in Zhejiang University >>> >>> >>> >> >> >> >> >> > > -- With Regards! Ye, Qian Made in Zhejiang University --0016e68ddefe2a59d4047e4406b6--