From zookeeper-user-return-1155-apmail-hadoop-zookeeper-user-archive=hadoop.apache.org@hadoop.apache.org Tue Jan 12 23:14:29 2010 Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 74768 invoked from network); 12 Jan 2010 23:14:29 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 12 Jan 2010 23:14:29 -0000 Received: (qmail 54906 invoked by uid 500); 12 Jan 2010 23:14:29 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 54853 invoked by uid 500); 12 Jan 2010 23:14:28 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 54843 invoked by uid 99); 12 Jan 2010 23:14:28 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:14:28 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of adam.rosien@gmail.com designates 209.85.216.204 as permitted sender) Received: from [209.85.216.204] (HELO mail-px0-f204.google.com) (209.85.216.204) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Jan 2010 23:14:22 +0000 Received: by pxi42 with SMTP id 42so17039787pxi.5 for ; Tue, 12 Jan 2010 15:14:01 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=iv9RhQj70Dn2f/o84EeqoKPGqC36e7dBZVkb4JkRcO8=; b=v22s6w/P2Gm2PtDgJu7ihQir3xhNC2KxKDIai9qEW+RCW90/ajZ/sy0L/cqlbQX/P7 Qs+F1S3zSXJ5yA6OcDekioI5oKEMisubLYZ0QxZFC4yvHKIbkDORTHIAOWdXOL0ndBw/ 5Tw9YKxWB5XNzbJG/PVxcKGgl8csRCzoeWRSA= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=A6TMPR/UFsqJltEYxtKRqgtnxf9ci01FX+MH1kwqKfaCriS5V9Nelg0bjRCpNUK25D +Crt0f6fl5nUE2zJjbg3VmhW8BHjk8HjQea/aMSiAKHLE+OYgO6RSfup0xgt/tH0J4VY 3apEffz2HDVtQvfGwg6ejFSq+MKzZhUOIDh6g= MIME-Version: 1.0 Sender: adam.rosien@gmail.com Received: by 10.142.5.28 with SMTP id 28mr5304866wfe.82.1263338041643; Tue, 12 Jan 2010 15:14:01 -0800 (PST) In-Reply-To: <4B4CFA57.2030008@apache.org> References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org> Date: Tue, 12 Jan 2010 15:14:01 -0800 X-Google-Sender-Auth: 9b6cce18ad9c1612 Message-ID: Subject: Re: Killing a zookeeper server From: Adam Rosien To: zookeeper-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable I have a related question: what's the behavior of a cluster of 3 when one is down? I've tried it and a leader is elected, but are there any other caveats for this situation? .. Adam On Tue, Jan 12, 2010 at 2:40 PM, Patrick Hunt wrote: > 12 servers? That's alot, if you dont' mind my asking why so many? Typical= ly > we recommend 5 - that way you can have one down for maintenance and still > have a failure that doesn't bring down the cluster. > > The "electing a leader" is probably the restarted machine attempting to > re-join the ensemble (it should join as a follower if you have a leader > already elected, given that it's xid is behind the existing leader.) Hard= to > tell though without the logs. > > You might also be seeing the initLimit exceeded, is the data you are stor= ing > in ZK large? Or perhaps network connectivity is slow? > http://hadoop.apache.org/zookeeper/docs/current/zookeeperAdmin.html#sc_cl= usterOptions > again the logs would give some insight on this. > > > Patrick > > Nick Bailey wrote: >> >> We are running zookeeper 3.1.0 >> >> Recently we noticed the cpu usage on our machines becoming >> increasingly high and we believe the cause is >> >> https://issues.apache.org/jira/browse/ZOOKEEPER-427 >> >> However our solution when we noticed the problem was to kill the >> zookeeper process and restart it. >> >> After doing that though it looks like the newly restarted zookeeper >> server is continually attempting to elect a leader even though one >> already exists. >> >> The process responses with 'imok' when asked, but the stat command >> returns 'ZooKeeperServer not running'. >> >> I belive that killing the current leader should trigger all servers >> to do an election and solve the problem, but I'm not sure. Should >> that be the course of action in this situation? >> >> Also we have 12 servers, but 5 are currently not running according to >> stat. =A0So I guess this isn't a problem unless we lose another one. >> We have plans to upgrade zookeeper to solve the cpu issue but haven't >> been able to do that yet. >> >> Any help appreciated, Nick Bailey >> >