Return-Path: Delivered-To: apmail-hadoop-zookeeper-user-archive@minotaur.apache.org Received: (qmail 78644 invoked from network); 25 Jan 2010 20:33:05 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 25 Jan 2010 20:33:05 -0000 Received: (qmail 48056 invoked by uid 500); 25 Jan 2010 20:33:05 -0000 Delivered-To: apmail-hadoop-zookeeper-user-archive@hadoop.apache.org Received: (qmail 48008 invoked by uid 500); 25 Jan 2010 20:33:04 -0000 Mailing-List: contact zookeeper-user-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: zookeeper-user@hadoop.apache.org Delivered-To: mailing list zookeeper-user@hadoop.apache.org Received: (qmail 47990 invoked by uid 99); 25 Jan 2010 20:33:04 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2010 20:33:04 +0000 X-ASF-Spam-Status: No, hits=1.2 required=10.0 tests=SPF_NEUTRAL X-Spam-Check-By: apache.org Received-SPF: neutral (athena.apache.org: local policy) Received: from [216.145.54.171] (HELO mrout1.yahoo.com) (216.145.54.171) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 25 Jan 2010 20:32:57 +0000 Received: from [10.73.135.251] (wifi-e-135-251.corp.yahoo.com [10.73.135.251]) by mrout1.yahoo.com (8.13.6/8.13.6/y.out) with ESMTP id o0PKWZhM046554; Mon, 25 Jan 2010 12:32:35 -0800 (PST) Message-ID: <4B5DFFE6.8080802@apache.org> Date: Mon, 25 Jan 2010 12:32:38 -0800 From: Patrick Hunt User-Agent: Thunderbird 2.0.0.23 (X11/20090817) MIME-Version: 1.0 To: zookeeper-user@hadoop.apache.org, Jean-Daniel Cryans Subject: Re: Killing a zookeeper server References: <1263333167.308920598@192.168.2.230> <4B4CFA57.2030008@apache.org> <1263337419.674422615@192.168.2.230> <1263339847.100820837@192.168.2.230> <4B4D1FFA.5050108@apache.org> <80BB3F00-890B-46E8-89CB-A5706A2A0522@mailtrust.com> <31a243e71001251032v65d09474xf7aafb9a1213cece@mail.gmail.com> <4B5DE7A6.9040204@apache.org> <31a243e71001251156l4a1644bbkd120893de920ac66@mail.gmail.com> In-Reply-To: <31a243e71001251156l4a1644bbkd120893de920ac66@mail.gmail.com> Content-Type: text/plain; charset=ISO-8859-1; format=flowed Content-Transfer-Encoding: 7bit According to the log for 222 it can't open a connection to the election port (3888) for any of the other servers. This seems very unusual. Can you verify that ther's connectivity on that port btw 222 and all the other servers? Also, can you re-run the netstat with -a option? We can see the listen sockets that way (omitted by netstat by default). It would be great if you could send the netstat for all 5 servers. Thanks, Patrick Jean-Daniel Cryans wrote: > Everything is here http://people.apache.org/~jdcryans/zk_election_bug.tar.gz > > The server we are trying to start is sv4borg222 (myid is 2) and we > started it around 10:03:21 > > Thx! > > J-D > > On Mon, Jan 25, 2010 at 10:49 AM, Patrick Hunt wrote: >> 1) Capture the logs from all 5 servers >> 2) give the config for the "down" server, also indicate that it's server id >> is. >> 3) if possible it would be interesting to see the netstat information from 2 >> of the servers - the one that's down and one or more of the others. >> >> Patrick >> >> Jean-Daniel Cryans wrote: >>> I believe we've just hit the same problem with zk-3.2.1 >>> >>> For some reason a machine crashed and it was part of our quorum of 5 >>> servers. When we try to restart it it this does this (I replaced >>> hostname and IP): >>> >>> 2010-01-25 10:25:06,469 WARN >>> org.apache.zookeeper.server.quorum.QuorumCnxManager: Cannot open >>> channel to 1 at election address somehost1/someip1:3888 >>> java.net.ConnectException: Connection refused >>> at sun.nio.ch.Net.connect(Native Method) >>> at sun.nio.ch.SocketChannelImpl.connect(SocketChannelImpl.java:507) >>> at java.nio.channels.SocketChannel.open(SocketChannel.java:146) >>> at >>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectOne(QuorumCnxManager.java:323) >>> at >>> org.apache.zookeeper.server.quorum.QuorumCnxManager.connectAll(QuorumCnxManager.java:356) >>> at >>> org.apache.zookeeper.server.quorum.FastLeaderElection.lookForLeader(FastLeaderElection.java:603) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:488) >>> >>> It has been like that for almost 20 minutes now, trying every other >>> server in the quorum on different channels. ruok says imok but all >>> other commands say that ZK server isn't running. I don't believe that >>> 3.2.2 will help unless ZK-547 does more than it seems to. >>> >>> Any else I should look at? >>> >>> Thx! >>> >>> J-D >>> >>> On Wed, Jan 13, 2010 at 11:19 AM, Nick Bailey wrote: >>>> So the solution for us was to just nuke zookeeper and restart everywhere. >>>> We will also be upgrading soon as well. >>>> >>>> To answer your question, yes I believe all the servers were running >>>> normally >>>> except for the fact that they were experiencing high CPU usage. As we >>>> began >>>> to see some CPU alerts I started restarting some of the servers. >>>> >>>> It was then that we noticed that they were not actually running according >>>> to >>>> 'stat'. >>>> >>>> I still have the log from one server with a debug level and the rest with >>>> a >>>> warn level. If you would like to see any of these and analyze them just >>>> let >>>> me know. >>>> >>>> Thanks for the help, >>>> Nick Bailey >>>>