hadoop-zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Henry Robinson <he...@cloudera.com>
Subject Re: Killing a zookeeper server
Date Tue, 12 Jan 2010 23:17:10 GMT
Hi Adam -

As long as a quorum of servers is running, ZK will be live. With majority
quorums, 2/3 is enough to keep going. In general, if fewer than half your
nodes have failed, ZK will keep on keeping on.

The main concern with a cluster of 2/3 machines is that a single further
failure will bring down the whole cluster.

Henry

2010/1/12 Adam Rosien <adam@rosien.net>

> I have a related question: what's the behavior of a cluster of 3 when
> one is down? I've tried it and a leader is elected, but are there any
> other caveats for this situation?
>
> .. Adam
>
> On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt <phunt@apache.org> wrote:
> > 12 servers? That's alot, if you dont' mind my asking why so many?
> Typically
> > we recommend 5 - that way you can have one down for maintenance and still
> > have a failure that doesn't bring down the cluster.
> >
> > The "electing a leader" is probably the restarted machine attempting to
> > re-join the ensemble (it should join as a follower if you have a leader
> > already elected, given that it's xid is behind the existing leader.) Hard
> to
> > tell though without the logs.
> >
> > You might also be seeing the initLimit exceeded, is the data you are
> storing
> > in ZK large? Or perhaps network connectivity is slow?
> >
> http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions
> > again the logs would give some insight on this.
> >
> >
> > Patrick
> >
> > Nick Bailey wrote:
> >>
> >> We are running zookeeper 3.1.0
> >>
> >> Recently we noticed the cpu usage on our machines becoming
> >> increasingly high and we believe the cause is
> >>
> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427
> >>
> >> However our solution when we noticed the problem was to kill the
> >> zookeeper process and restart it.
> >>
> >> After doing that though it looks like the newly restarted zookeeper
> >> server is continually attempting to elect a leader even though one
> >> already exists.
> >>
> >> The process responses with 'imok' when asked, but the stat command
> >> returns 'ZooKeeperServer not running'.
> >>
> >> I belive that killing the current leader should trigger all servers
> >> to do an election and solve the problem, but I'm not sure. Should
> >> that be the course of action in this situation?
> >>
> >> Also we have 12 servers, but 5 are currently not running according to
> >> stat.  So I guess this isn't a problem unless we lose another one.
> >> We have plans to upgrade zookeeper to solve the cpu issue but haven't
> >> been able to do that yet.
> >>
> >> Any help appreciated, Nick Bailey
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message