Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (athena.apache.org: domain of shralex@gmail.com designates
 209.85.223.178 as permitted sender)
MIME-Version: 1.0
In-Reply-To: <4BCC7824-909D-47B9-A9CD-D431CD86CD62@dkatz.org>
References: <4BCC7824-909D-47B9-A9CD-D431CD86CD62@dkatz.org>
From: Alexander Shraer <shralex@gmail.com>
Date: Mon, 22 Apr 2013 15:08:07 -0700
Message-ID: 
 <CANcXBFPDRz16fvHhCGtnYSXqhqMDUV-nrefaH96003DuXAhVFA@mail.gmail.com>
Subject: Re: Fully Elastic Zookeeper Ensembles
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Content-Type: multipart/alternative; boundary=90e6ba6e8ef0e8d9f904dafa4ed3

--90e6ba6e8ef0e8d9f904dafa4ed3
Content-Type: text/plain; charset=ISO-8859-1

Hi Dave,

Dynamic reconfiguration indeed is not currently possible from/to an
ensemble containing a single server.
Currently, a single server means "standalone" mode, but we believe that its
not really necessary,
so the following JIRAs propose to eliminate/disable this mode. Once this
happens, dynamic reconfiguration should work with a single member as well:

https://issues.apache.org/jira/browse/ZOOKEEPER-1692
https://issues.apache.org/jira/browse/ZOOKEEPER-1691

Regarding loss of quorum - as you say if you loose a quorum you can loose
state too. If you want to recover from a loss of a quorum
you must be sure that the quorum isn't just disconnected somewhere (split
brain), so that requires knowing that the quorum is actually down,
which is not theoretically possible in an asynchronous system (since
communication problems can be indistinguishable from failures). So you're
right, avoiding split brain probably requires some kind of an admin telling
that its ok to go on, or synchrony/failure detection assumptions.

Regarding the state reconciliation - during FastLeaderElection, one of the
servers sees that it has the most up-to-date history prefix, among a quorum
that talks with it. It then does a state sync with a quorum so that they
have this prefix and finally commits the prefix to a quorum. Anyone else
connecting to the leader after the commit will get its history truncated to
match the leader's history. In the normal case, we know that nothing this
late server has could have been committed since the leader talked with a
quorum. In the case you're describing where we lost a quorum this is not
necessarily true, so we may loose data. If you want more details about
recovery, you can read the Zab paper, or look on the short description on
page 3 here: http://www.cs.technion.ac.il/~shralex/zkreconfig.pdf


Alex


On Mon, Apr 22, 2013 at 2:44 PM, Dave Katz <dkatz@dkatz.org> wrote:

> I've been thinking about the implications of running Zookeeper in a fully
> dynamic distributed system, in which the number of nodes can be as few as
> one, or can be quite large.  This has led to a few questions.
>
> The dynamic server reconfiguration work appears to require a working
> quorum of servers under the old config in order to distribute the new
> config.  This implies that the mechanism cannot be used if a quorum is lost
> (a common-mode failure across many servers).  This leads to the obvious
> question, how does one recover from a (semi-)permanent loss of quorum?
>  This would seem to require the HOG (Hand Of God) approach, with an
> external agent restarting the ZK servers with a new (shorter) server list.
>  Presumably, the loss of quorum means a potential loss of state, since
> updates may not have made it to any of the surviving servers.
>
> If servers come to the ensemble with disparate contents, how does ZK
> converge on the new state?  From what I've been able to read, it appears
> that all servers will end up converging to the state of the newly elected
> leader (and so any divergent contents on other nodes are discarded).  Is
> this the case?
>
> If the system is to be fully dynamic, we have to deal with the two-node
> problem.  How best to do this?  In a two-node ensemble, it is guaranteed
> that if one of the nodes fails, the other node is guaranteed to be
> consistent, true?  So if there is an external mechanism to prevent split
> brain, it should be possible to restart the surviving node in standalone
> mode, and once the second node returns, restarting both nodes should still
> guarantee consistency, yes?
>
> Thanks in advance,
>
> --Dave
>
>

--90e6ba6e8ef0e8d9f904dafa4ed3--