Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 69332 invoked from network); 25 Jan 2010 19:56:35 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jan 2010 19:56:35 -0000 Received: (qmail 3962 invoked by uid 500); 25 Jan 2010 19:56:34 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 3922 invoked by uid 500); 25 Jan 2010 19:56:34 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 3912 invoked by uid 99); 25 Jan 2010 19:56:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2010 19:56:34 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=10.0 tests=SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jdcryans@gmail.com designates 209.85.211.187 as permitted sender) Received: from [209.85.211.187] (HELO mail-yw0-f187.google.com) (209.85.211.187) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2010 19:56:28 +0000 Received: by ywh17 with SMTP id 17so3387878ywh.2 for ; Mon, 25 Jan 2010 11:56:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:sender:received:in-reply-to :references:date:x-google-sender-auth:message-id:subject:from:to :content-type:content-transfer-encoding; bh=xLYLrCTEM7R1iDshLDZWt8yVv17v90GkQzcj2WIoUcY=; b=fFCj9YFFIyPoQ6kcNFfF0xjUsvxLhdDjSE6iTXOORfy6IIf1KyLdB1AiLvkrOvHu3S 93Smw/4gJj+qRaqvUWPqOo8zoMvgmli5+s4YEtpTwBO6H4KIAk2G5hvWnl6H4NbPMQ2Q 5eAVRnZnzyzqVza7u8IVfLzGOd6dwuuML8OXk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:sender:in-reply-to:references:date :x-google-sender-auth:message-id:subject:from:to:content-type :content-transfer-encoding; b=Q1hAul4X7eXtKqbBrU2TbM6g2LvohRVJCXqfddJp3kwtuNrOXCOs7kZAlNP46iuHQL adMmUY+GOY7nNDMFUNfEOM14s88gTf6iDMATakaTr0TSBJ95Qw5G2cxXhgPw4UzGnnWr 04dQMYfQ43rO8IsfyGEqocPZDcjQ6XVmv0mg4= MIME-Version: 1.0 Sender: jdcryans@gmail.com Received: by 10.90.126.9 with SMTP id y9mr6132002agc.63.1264449367208; Mon, 25 Jan 2010 11:56:07 -0800 (PST) In-Reply-To: <4B5DE7A6.9040204@apache.org> References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org> <1263337419.674422615@192.168.2.230> <1263339847.100820837@192.168.2.230> <4B4D1FFA.5050108@apache.org> <80BB3F00-890B-46E8-89CB-A5706A2A0522@mailtrust.com> <31a243e71001251032v65d09474xf7aafb9a1213cece@mail.gmail.com> <4B5DE7A6.9040204@apache.org> Date: Mon, 25 Jan 2010 11:56:07 -0800 X-Google-Sender-Auth: 59159b81876111f0 Message-ID: <31a243e71001251156l4a1644bbkd120893de920ac66@mail.gmail.com> Subject: Re: Killing a zookeeper server From: Jean-Daniel Cryans To: zookeeper-user@hadoop.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.g= z The server we are trying to start is sv4borg222 (myid is 2) and we started it around 10:03:21 Thx! J-D On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt wrote: > 1) Capture the logs from all 5 servers > 2) give the config for the "down" server, also indicate that it's server = id > is. > 3) if possible it would be interesting to see the netstat information fro= m 2 > of the servers - the one that's down and one or more of the others. > > Patrick > > Jean-Daniel Cryans wrote: >> >> I believe we've just hit the same problem with zk-3.2.1 >> >> For some reason a machine crashed and it was part of our quorum of 5 >> servers. When we try to restart it it this does this (I replaced >> hostname and IP): >> >> 2010-01-25 10:25:06,469 WARN >> org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open >> channel to 1 at election address somehost1/someip1:3888 >> java.net.ConnectException: Connection refused >> =A0 =A0 =A0 =A0at sun.nio.ch.Net.connect(Native Method) >> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl= .java:507) >> =A0 =A0 =A0 =A0at java.nio.channels.SocketChannel.open(SocketChannel.jav= a:146) >> =A0 =A0 =A0 =A0at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnx= Manager.java:323) >> =A0 =A0 =A0 =A0at >> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnx= Manager.java:356) >> =A0 =A0 =A0 =A0at >> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(Fast= LeaderElection.java:603) >> =A0 =A0 =A0 =A0at >> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) >> >> It has been like that for almost 20 minutes now, trying every other >> server in the quorum on different channels. ruok says imok but all >> other commands say that ZK server isn't running. I don't believe that >> 3.2.2 will help unless ZK-547 does more than it seems to. >> >> Any else I should look at? >> >> Thx! >> >> J-D >> >> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey wrot= e: >>> >>> So the solution for us was to just nuke zookeeper and restart everywher= e. >>> =A0We will also be upgrading soon as well. >>> >>> To answer your question, yes I believe all the servers were running >>> normally >>> except for the fact that they were experiencing high CPU usage. =A0As w= e >>> began >>> to see some CPU alerts I started restarting some of the servers. >>> >>> It was then that we noticed that they were not actually running accordi= ng >>> to >>> 'stat'. >>> >>> I still have the log from one server with a debug level and the rest wi= th >>> a >>> warn level. If you would like to see any of these and analyze them just >>> let >>> me know. >>> >>> Thanks for the help, >>> Nick Bailey >>> >