Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: zookeeper-user@hadoop.apache.org
Received-SPF: pass (nike.apache.org: local policy)
MIME-Version: 1.0
In-Reply-To: <cf67d0ac1001121514o2c1452a5g74c80790404409ea@mail.gmail.com>
References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org>
	 <cf67d0ac1001121514o2c1452a5g74c80790404409ea@mail.gmail.com>
Date: Tue, 12 Jan 2010 15:17:10 -0800
Message-ID: <b54edc3a1001121517w57257965xe0fd27894b870b3b@mail.gmail.com>
Subject: Re: Killing a zookeeper server
From: Henry Robinson <henry@cloudera.com>
To: zookeeper-user@hadoop.apache.org
Content-Type: multipart/alternative; boundary=0016e6d7e6817a6509047cffda66

--0016e6d7e6817a6509047cffda66
Content-Type: text/plain; charset=ISO-8859-1

Hi Adam -

As long as a quorum of servers is running, ZK will be live. With majority
quorums, 2/3 is enough to keep going. In general, if fewer than half your
nodes have failed, ZK will keep on keeping on.

The main concern with a cluster of 2/3 machines is that a single further
failure will bring down the whole cluster.

Henry

2010/1/12 Adam Rosien <adam@rosien.net>

> I have a related question: what's the behavior of a cluster of 3 when
> one is down? I've tried it and a leader is elected, but are there any
> other caveats for this situation?
>
> .. Adam
>
> On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt <phunt@apache.org> wrote:
> > 12 servers? That's alot, if you dont' mind my asking why so many?
> Typically
> > we recommend 5 - that way you can have one down for maintenance and still
> > have a failure that doesn't bring down the cluster.
> >
> > The "electing a leader" is probably the restarted machine attempting to
> > re-join the ensemble (it should join as a follower if you have a leader
> > already elected, given that it's xid is behind the existing leader.) Hard
> to
> > tell though without the logs.
> >
> > You might also be seeing the initLimit exceeded, is the data you are
> storing
> > in ZK large? Or perhaps network connectivity is slow?
> >
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
> > again the logs would give some insight on this.
> >
> >
> > Patrick
> >
> > Nick Bailey wrote:
> >>
> >> We are running zookeeper 3.1.0
> >>
> >> Recently we noticed the cpu usage on our machines becoming
> >> increasingly high and we believe the cause is
> >>
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427
> >>
> >> However our solution when we noticed the problem was to kill the
> >> zookeeper process and restart it.
> >>
> >> After doing that though it looks like the newly restarted zookeeper
> >> server is continually attempting to elect a leader even though one
> >> already exists.
> >>
> >> The process responses with 'imok' when asked, but the stat command
> >> returns 'ZooKeeperServer not running'.
> >>
> >> I belive that killing the current leader should trigger all servers
> >> to do an election and solve the problem, but I'm not sure. Should
> >> that be the course of action in this situation?
> >>
> >> Also we have 12 servers, but 5 are currently not running according to
> >> stat.  So I guess this isn't a problem unless we lose another one.
> >> We have plans to upgrade zookeeper to solve the cpu issue but haven't
> >> been able to do that yet.
> >>
> >> Any help appreciated, Nick Bailey
> >>
> >
>

--0016e6d7e6817a6509047cffda66--