Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 62687 invoked from network); 12 Jan 2010 22:41:31 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 22:41:31 -0000 Received: (qmail 21261 invoked by uid 500); 12 Jan 2010 22:41:30 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 21238 invoked by uid 500); 12 Jan 2010 22:41:30 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 21224 invoked by uid 99); 12 Jan 2010 22:41:30 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 22:41:30 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (nike.apache.org: local policy) Received: from [216.145.54.173] (HELO mrout3.yahoo.com) (216.145.54.173) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 22:41:22 +0000 Received: from [10.73.135.245] (wifi-e-135-245.corp.yahoo.com [10.73.135.245]) by mrout3.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o0CMeMaV040464; Tue, 12 Jan 2010 14:40:23 -0800 (PST) Message-ID: <4B4CFA57.2030008@apache.org> Date: Tue, 12 Jan 2010 14:40:23 -0800 From: Patrick Hunt User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org, nicholas.bailey@rackspace.com Subject: Re: Killing a zookeeper server References: <1263333167.308920598@192.168.2.230> In-Reply-To: <1263333167.308920598@192.168.2.230> Content-Type: text/plain; charset=UTF-8; format=flowed Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org 12 servers? That's alot, if you dont' mind my asking why so many? Typically we recommend 5 - that way you can have one down for maintenance and still have a failure that doesn't bring down the cluster. The "electing a leader" is probably the restarted machine attempting to re-join the ensemble (it should join as a follower if you have a leader already elected, given that it's xid is behind the existing leader.) Hard to tell though without the logs. You might also be seeing the initLimit exceeded, is the data you are storing in ZK large? Or perhaps network connectivity is slow? http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_clusterOptions again the logs would give some insight on this. Patrick Nick Bailey wrote: > We are running zookeeper 3.1.0 > > Recently we noticed the cpu usage on our machines becoming > increasingly high and we believe the cause is > > https://issues.apache.org/jira/browse/ZOOKEEPER-427 > > However our solution when we noticed the problem was to kill the > zookeeper process and restart it. > > After doing that though it looks like the newly restarted zookeeper > server is continually attempting to elect a leader even though one > already exists. > > The process responses with 'imok' when asked, but the stat command > returns 'ZooKeeperServer not running'. > > I belive that killing the current leader should trigger all servers > to do an election and solve the problem, but I'm not sure. Should > that be the course of action in this situation? > > Also we have 12 servers, but 5 are currently not running according to > stat. So I guess this isn't a problem unless we lose another one. > We have plans to upgrade zookeeper to solve the cpu issue but haven't > been able to do that yet. > > Any help appreciated, Nick Bailey >