Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 75442 invoked from network); 12 Jan 2010 23:17:40 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 23:17:40 -0000 Received: (qmail 58363 invoked by uid 500); 12 Jan 2010 23:17:39 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 58347 invoked by uid 500); 12 Jan 2010 23:17:39 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 58336 invoked by uid 99); 12 Jan 2010 23:17:39 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:17:39 +0000 X-ASF-Spam-Status: No, hits=2.2 required=10.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: local policy) Received: from [209.85.219.217] (HELO mail-ew0-f217.google.com) (209.85.219.217) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:17:31 +0000 Received: by ewy9 with SMTP id 9so6451852ewy.11 for ; Tue, 12 Jan 2010 15:17:11 -0800 (PST) MIME-Version: 1.0 Received: by 10.216.86.11 with SMTP id v11mr2681743wee.220.1263338230801; Tue, 12 Jan 2010 15:17:10 -0800 (PST) In-Reply-To: References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org> Date: Tue, 12 Jan 2010 15:17:10 -0800 Message-ID: Subject: Re: Killing a zookeeper server From: Henry Robinson To: zookeeper-user@hadoop.apache.org Content-Type: multipart/alternative; boundary=0016e6d7e6817a6509047cffda66 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6d7e6817a6509047cffda66 Content-Type: text/plain; charset=ISO-8859-1 Hi Adam - As long as a quorum of servers is running, ZK will be live. With majority quorums, 2/3 is enough to keep going. In general, if fewer than half your nodes have failed, ZK will keep on keeping on. The main concern with a cluster of 2/3 machines is that a single further failure will bring down the whole cluster. Henry 2010/1/12 Adam Rosien > I have a related question: what's the behavior of a cluster of 3 when > one is down? I've tried it and a leader is elected, but are there any > other caveats for this situation? > > .. Adam > > On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt wrote: > > 12 servers? That's alot, if you dont' mind my asking why so many? > Typically > > we recommend 5 - that way you can have one down for maintenance and still > > have a failure that doesn't bring down the cluster. > > > > The "electing a leader" is probably the restarted machine attempting to > > re-join the ensemble (it should join as a follower if you have a leader > > already elected, given that it's xid is behind the existing leader.) Hard > to > > tell though without the logs. > > > > You might also be seeing the initLimit exceeded, is the data you are > storing > > in ZK large? Or perhaps network connectivity is slow? > > > http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions > > again the logs would give some insight on this. > > > > > > Patrick > > > > Nick Bailey wrote: > >> > >> We are running zookeeper 3.1.0 > >> > >> Recently we noticed the cpu usage on our machines becoming > >> increasingly high and we believe the cause is > >> > >> https://issues.apache.org/jira/browse/ZOOKEEPER-427 > >> > >> However our solution when we noticed the problem was to kill the > >> zookeeper process and restart it. > >> > >> After doing that though it looks like the newly restarted zookeeper > >> server is continually attempting to elect a leader even though one > >> already exists. > >> > >> The process responses with 'imok' when asked, but the stat command > >> returns 'ZooKeeperServer not running'. > >> > >> I belive that killing the current leader should trigger all servers > >> to do an election and solve the problem, but I'm not sure. Should > >> that be the course of action in this situation? > >> > >> Also we have 12 servers, but 5 are currently not running according to > >> stat. So I guess this isn't a problem unless we lose another one. > >> We have plans to upgrade zookeeper to solve the cpu issue but haven't > >> been able to do that yet. > >> > >> Any help appreciated, Nick Bailey > >> > > > --0016e6d7e6817a6509047cffda66--