Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 85086 invoked from network); 29 Jan 2010 02:45:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 29 Jan 2010 02:45:35 -0000 Received: (qmail 19056 invoked by uid 500); 29 Jan 2010 02:45:35 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 18981 invoked by uid 500); 29 Jan 2010 02:45:34 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 18971 invoked by uid 99); 29 Jan 2010 02:45:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 02:45:34 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of yeqian.zju@gmail.com designates 209.85.223.173 as permitted sender) Received: from [209.85.223.173] (HELO mail-iw0-f173.google.com) (209.85.223.173) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jan 2010 02:45:23 +0000 Received: by iwn3 with SMTP id 3so799134iwn.23 for ; Thu, 28 Jan 2010 18:45:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:received:in-reply-to:references :date:message-id:subject:from:to:content-type; bh=p+s6jJ5pVe2cxh6GEGC/lP1sPH/bsLh+7gKDyQ8AgZQ=; b=TSqXxxfGj+Cj5eMzuMDqn0HRGUmFfJENLRxzl6htYUqCLmW+0kkMHgK7TX7sKvhGfT nGw1zD/0sAv6x56Vs6JiUuAVlEOfQktq5J966tiuAvloT6woYl6dFFW3iXEHXNnTpmLz ABqNGalKElYik6GHA4QtNk7iDPcNJUszhw7bQ= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=NWDYTmz+LUqm1uwEnq3Dw8mQ+I6EPmiEG36FLTF1211J+HL/v2S/HP9ze55U84Bhxv SM0Icz/zlS8ttjayLSNBFCWQya8HZDxTXhUIeFP2KEfIFVkG0WFNHXjoGAq3IS6j9rrY Vb50iD6BQOxOfDkBF042H/VKyA46gMmnr5tFs= MIME-Version: 1.0 Received: by 10.231.79.136 with SMTP id p8mr385366ibk.4.1264733102328; Thu, 28 Jan 2010 18:45:02 -0800 (PST) In-Reply-To: References: Date: Fri, 29 Jan 2010 10:45:02 +0800 Message-ID: Subject: Re: Q about ZK internal: how commit is being remembered From: Qian Ye To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=001485e3dee64cca41047e449fc8 X-Virus-Checked: Checked by ClamAV on apache.org --001485e3dee64cca41047e449fc8 Content-Type: text/plain; charset=UTF-8 Thanks Mahadev, I see what you mean. On Fri, Jan 29, 2010 at 10:06 AM, Mahadev Konar wrote: > Qian, > > ZooKeeper gurantees that if a client sees some transaction response, then > it will persist but the one's that a client does not see might be discarded > or committed. So in case a quorum does not log the transaction, there might > be a case wherein a zookeeper server which does not have the logged > transaction becomes the leader (because the machines with the logged > transaction are down). In that case the transaction is discarded. In a case > when a machine which has the logged transaction becomes the leader that > transaction will be committed. > > Hope that clear your doubt. > > mahadev > > > On 1/28/10 6:02 PM, "Qian Ye" wrote: > > > Thanks henry and ben, actually I have read the paper henry mentioned in > this > > mail, but I'm still not so clear with some of the details. Anyway, maybe > > more study on the source code can help me understanding. Since Ben said > > that, "if less than a quorum of servers have accepted a transaction, we > can > > commit or discard". Would this feature cause any unexpected problem? Can > you > > give some hints about this issue? > > > > > > > > On Fri, Jan 29, 2010 at 1:09 AM, Benjamin Reed > wrote: > > > >> henry is correct. just to state another way, Zab guarantees that if a > >> quorum of servers have accepted a transaction, the transaction will > commit. > >> this means that if less than a quorum of servers have accepted a > >> transaction, we can commit or discard. the only constraint we have in > >> choosing is ordering. we have to decide which partially accepted > >> transactions are going to be committed and which discarded before we > propose > >> any new messages so that ordering is preserved. > >> > >> ben > >> > >> > >> Henry Robinson wrote: > >> > >>> Hi - > >>> > >>> Note that a machine that has the highest received zxid will necessarily > >>> have > >>> seen the most recent transaction that was logged by a quorum of > followers > >>> (the FIFO property of TCP again ensures that all previous messages will > >>> have > >>> been seen). This is the property that ZAB needs to preserve. The idea > is > >>> to > >>> avoid missing a commit that went to a node that has since failed. > >>> > >>> I was therefore slightly imprecise in my previous mail - it's possible > for > >>> only partially-proposed proposals to be committed if the leader that is > >>> elected next has seen them. Only when another proposal is committed > >>> instead > >>> must the original proposal be discarded. > >>> > >>> I highly recommend Ben Reed's and Flavio Junqueira's LADIS paper on the > >>> subject, for those with portal.acm.org access: > >>> http://portal.acm.org/citation.cfm?id=1529978 > >>> > >>> Henry > >>> > >>> On 27 January 2010 21:52, Qian Ye wrote: > >>> > >>> > >>> > >>>> Hi Henry: > >>>> > >>>> According to your explanation, "*ZAB makes the guarantee that a > proposal > >>>> which has been logged by > >>>> a quorum of followers will eventually be committed*" , however, the > >>>> source > >>>> code of Zookeeper, the FastLeaderElection.java file, shows that, in > the > >>>> election, the candidates only provide their zxid in the votes, the one > >>>> with > >>>> the max zxid would win the election. I mean, it seems that no check > has > >>>> been > >>>> made to make sure whether the latest proposal has been logged by a > quorum > >>>> of > >>>> servers. > >>>> > >>>> In this situation, the zookeeper would deliver a proposal, which is > known > >>>> as > >>>> a failed one by the client. Imagine this scenario, a zookeeper cluster > >>>> with > >>>> 5 servers, Leader only receives 1 ack for proposal A, after a timeout, > >>>> the > >>>> client is told that the proposal failed. At this time, all servers > >>>> restart > >>>> due to a power failure. The server have the log of proposal A would be > >>>> the > >>>> leader, however, the client is told the proposal A failed. > >>>> > >>>> Do I misunderstand this? > >>>> > >>>> > >>>> On Wed, Jan 27, 2010 at 10:37 AM, Henry Robinson > >>>> wrote: > >>>> > >>>> > >>>> > >>>>> Qing - > >>>>> > >>>>> That part of the documentation is slightly confusing. The elected > leader > >>>>> must have the highest zxid that has been written to disk by a quorum > of > >>>>> followers. ZAB makes the guarantee that a proposal which has been > logged > >>>>> > >>>>> > >>>> by > >>>> > >>>> > >>>>> a quorum of followers will eventually be committed. Conversely, any > >>>>> proposals that *don't* get logged by a quorum before the leader > sending > >>>>> them > >>>>> dies will not be committed. One of the ZAB papers covers both these > >>>>> situations - making sure proposals are committed or skipped at the > right > >>>>> moments. > >>>>> > >>>>> So you get the neat property that leader election can be live in > exactly > >>>>> the > >>>>> case where the ZK cluster is live. If a quorum of peers aren't > available > >>>>> > >>>>> > >>>> to > >>>> > >>>> > >>>>> elect the leader, the resulting cluster won't be live anyhow, so it's > ok > >>>>> for > >>>>> leader election to fail. > >>>>> > >>>>> FLP impossibility isn't actually strictly relevant for ZAB, because > FLP > >>>>> requires that message reordering is possible (see all the stuff in > that > >>>>> paper about non-deterministically drawing messages from a potentially > >>>>> deliverable set). TCP FIFO channels don't reorder, so provide the > extra > >>>>> signalling that ZAB requires. > >>>>> > >>>>> cheers, > >>>>> Henry > >>>>> > >>>>> 2010/1/26 Qing Yan > >>>>> > >>>>> > >>>>> > >>>>>> Hi, > >>>>>> > >>>>>> I have question about how zookeeper *remembers* a commit operation. > >>>>>> > >>>>>> According to > >>>>>> > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> > http://hadoop.apache.org/zookeeper/docs/r3.2.2/zookeeperInternals.html#sc_s > >>>> ummary > >>>> > >>>> > >>>>> > >>>>>> > >>>>>> > >>>>>> The leader will issue a COMMIT to all followers as soon as a quorum > of > >>>>>> followers have ACKed a message. Since messages are ACKed in order, > >>>>>> > >>>>>> > >>>>> COMMITs > >>>>> > >>>>> > >>>>>> will be sent by the leader as received by the followers in order. > >>>>>> > >>>>>> COMMITs are processed in order. Followers deliver a proposals > message > >>>>>> > >>>>>> > >>>>> when > >>>>> > >>>>> > >>>>>> that proposal is committed. > >>>>>> > >>>>>> > >>>>>> My question is will leader wait for COMMIT to be processed by quorum > >>>>>> of followers before consider > >>>>>> COMMIT to be success? From the documentation it seems that leader > >>>>>> > >>>>>> > >>>>> handles > >>>> > >>>> > >>>>> COMMIT asynchronously and > >>>>>> don't expect confirmation from followers. In the extreme case, what > >>>>>> > >>>>>> > >>>>> happens > >>>>> > >>>>> > >>>>>> if leader issue a COMMIT > >>>>>> to all followers and crash immediately before the COMMIT message can > go > >>>>>> > >>>>>> > >>>>> out > >>>>> > >>>>> > >>>>>> of the network. How the system > >>>>>> remembers the COMMIT ever happens? > >>>>>> > >>>>>> Actually this is related to the leader election process: > >>>>>> > >>>>>> > >>>>>> ZooKeeper messaging doesn't care about the exact method of electing > a > >>>>>> leader > >>>>>> has long as the following holds: > >>>>>> > >>>>>> - > >>>>>> > >>>>>> The leader has seen the highest zxid of all the followers. > >>>>>> - > >>>>>> > >>>>>> A quorum of servers have committed to following the leader. > >>>>>> > >>>>>> Of these two requirements only the first, the highest zxid amoung > the > >>>>>> followers needs to hold for correct operation. > >>>>>> > >>>>>> > >>>>>> > >>>>>> Is there a liveness issue try to find "The leader has seen the > highest > >>>>>> > >>>>>> > >>>>> zxid > >>>>> > >>>>> > >>>>>> of all the followers"? What if some of the followers (which happens > to > >>>>>> holding the highest zxid) cannot be contacted(FLP impossible > result?) > >>>>>> It will be more striaghtforward if COMMIT requires confirmation > from a > >>>>>> quorum of the followers. But I guess things get > >>>>>> optimized according to Zab's FIFO nature...just want to hear some > >>>>>> clarification about it. > >>>>>> > >>>>>> Thanks alot! > >>>>>> > >>>>>> > >>>>>> > >>>>> > >>>> -- > >>>> With Regards! > >>>> > >>>> Ye, Qian > >>>> Made in Zhejiang University > >>>> > >>>> > >>>> > >>> > >>> > >>> > >>> > >>> > >> > >> > > > > -- With Regards! Ye, Qian Made in Zhejiang University --001485e3dee64cca41047e449fc8--