Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 344ACF11C for ; Mon, 22 Apr 2013 22:09:07 +0000 (UTC) Received: (qmail 42493 invoked by uid 500); 22 Apr 2013 22:08:53 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 42438 invoked by uid 500); 22 Apr 2013 22:08:53 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 42335 invoked by uid 99); 22 Apr 2013 22:08:53 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Apr 2013 22:08:53 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of shralex@gmail.com designates 209.85.223.178 as permitted sender) Received: from [209.85.223.178] (HELO mail-ie0-f178.google.com) (209.85.223.178) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 22 Apr 2013 22:08:47 +0000 Received: by mail-ie0-f178.google.com with SMTP id aq17so6419263iec.9 for ; Mon, 22 Apr 2013 15:08:27 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=x-received:mime-version:in-reply-to:references:from:date:message-id :subject:to:content-type; bh=D1iaVEfgAn5jMqKGd2Nes1QyLOB2RMZwPAOaeLwzshk=; b=JzLiuQWwYfZf053uSxvRpkLLdrLzcmkSA6XG3n52t82y6OYSxUWtAC107Gp7NdyY+q pePSdeyM0RDDre1nDEavSR2y2+3VbnHJL6W32STWsd9f/jtg1cxDhFAz7bujR6YvWCnz ZiJnwDGC2uFip72upizNCGyozCB/GMlYbepARowYIx9aE26d7w35EF84F0BmKKYngN3a 0thNhqo7PCRaANGVcP1pNPWH+RBj4suk5Mg6KbY3JzRvNiCStI80H6rG+iDa/MwPZowh cbcXiXugwqueaIdx1muqCX+9ChAn6asTC/6ZINQGCa7ElPK12bv/qRD06a/nKuF0w6hW I2eg== X-Received: by 10.42.176.68 with SMTP id bd4mr13591546icb.4.1366668507401; Mon, 22 Apr 2013 15:08:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.64.141.44 with HTTP; Mon, 22 Apr 2013 15:08:07 -0700 (PDT) In-Reply-To: <4BCC7824-909D-47B9-A9CD-D431CD86CD62@dkatz.org> References: <4BCC7824-909D-47B9-A9CD-D431CD86CD62@dkatz.org> From: Alexander Shraer Date: Mon, 22 Apr 2013 15:08:07 -0700 Message-ID: Subject: Re: Fully Elastic Zookeeper Ensembles To: "user@zookeeper.apache.org" Content-Type: multipart/alternative; boundary=90e6ba6e8ef0e8d9f904dafa4ed3 X-Virus-Checked: Checked by ClamAV on apache.org --90e6ba6e8ef0e8d9f904dafa4ed3 Content-Type: text/plain; charset=ISO-8859-1 Hi Dave, Dynamic reconfiguration indeed is not currently possible from/to an ensemble containing a single server. Currently, a single server means "standalone" mode, but we believe that its not really necessary, so the following JIRAs propose to eliminate/disable this mode. Once this happens, dynamic reconfiguration should work with a single member as well: https://issues.apache.org/jira/browse/ZOOKEEPER-1692 https://issues.apache.org/jira/browse/ZOOKEEPER-1691 Regarding loss of quorum - as you say if you loose a quorum you can loose state too. If you want to recover from a loss of a quorum you must be sure that the quorum isn't just disconnected somewhere (split brain), so that requires knowing that the quorum is actually down, which is not theoretically possible in an asynchronous system (since communication problems can be indistinguishable from failures). So you're right, avoiding split brain probably requires some kind of an admin telling that its ok to go on, or synchrony/failure detection assumptions. Regarding the state reconciliation - during FastLeaderElection, one of the servers sees that it has the most up-to-date history prefix, among a quorum that talks with it. It then does a state sync with a quorum so that they have this prefix and finally commits the prefix to a quorum. Anyone else connecting to the leader after the commit will get its history truncated to match the leader's history. In the normal case, we know that nothing this late server has could have been committed since the leader talked with a quorum. In the case you're describing where we lost a quorum this is not necessarily true, so we may loose data. If you want more details about recovery, you can read the Zab paper, or look on the short description on page 3 here: http://www.cs.technion.ac.il/~shralex/zkreconfig.pdf Alex On Mon, Apr 22, 2013 at 2:44 PM, Dave Katz wrote: > I've been thinking about the implications of running Zookeeper in a fully > dynamic distributed system, in which the number of nodes can be as few as > one, or can be quite large. This has led to a few questions. > > The dynamic server reconfiguration work appears to require a working > quorum of servers under the old config in order to distribute the new > config. This implies that the mechanism cannot be used if a quorum is lost > (a common-mode failure across many servers). This leads to the obvious > question, how does one recover from a (semi-)permanent loss of quorum? > This would seem to require the HOG (Hand Of God) approach, with an > external agent restarting the ZK servers with a new (shorter) server list. > Presumably, the loss of quorum means a potential loss of state, since > updates may not have made it to any of the surviving servers. > > If servers come to the ensemble with disparate contents, how does ZK > converge on the new state? From what I've been able to read, it appears > that all servers will end up converging to the state of the newly elected > leader (and so any divergent contents on other nodes are discarded). Is > this the case? > > If the system is to be fully dynamic, we have to deal with the two-node > problem. How best to do this? In a two-node ensemble, it is guaranteed > that if one of the nodes fails, the other node is guaranteed to be > consistent, true? So if there is an external mechanism to prevent split > brain, it should be possible to restart the surviving node in standalone > mode, and once the second node returns, restarting both nodes should still > guarantee consistency, yes? > > Thanks in advance, > > --Dave > > --90e6ba6e8ef0e8d9f904dafa4ed3--