Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 31A9110EC4 for ; Wed, 12 Feb 2014 19:15:30 +0000 (UTC) Received: (qmail 68264 invoked by uid 500); 12 Feb 2014 19:15:20 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 68201 invoked by uid 500); 12 Feb 2014 19:15:20 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 68189 invoked by uid 99); 12 Feb 2014 19:15:19 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 19:15:19 +0000 X-ASF-Spam-Status: No, hits=2.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_SOFTFAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: softfail (nike.apache.org: transitioning domain of deepak.jagtap@maxta.com does not designate 209.85.219.43 as permitted sender) Received: from [209.85.219.43] (HELO mail-oa0-f43.google.com) (209.85.219.43) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 12 Feb 2014 19:15:13 +0000 Received: by mail-oa0-f43.google.com with SMTP id h16so11500737oag.16 for ; Wed, 12 Feb 2014 11:14:52 -0800 (PST) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=mMILj4dC3VwjLTQOMOepMkM/VXoAFJ8woyK5IhTkFQM=; b=EHhvmQ1QikWdPdubxZUNQdoGwfrPEjU2zjOess70WEBaR4VJfsReAwsIoF42m40Iny +/GwERbIFiZTgUL3Ra5e0iq6PC+Yv2Hku1yphCInLtIE+Us85Sr0Ag4U2VE7p1wsrGQa z18SfOkRXvrLL3UxcsKjfBqiKTKsbK5wmV1Qi5+OP0glWwJexRvDZd8BjLdnGMSWla+3 w0EZ7ZqCWBBW6lR4iBc+DJy/D1EyymqFPdw7qc5Sk11lpJoYcCaoSlGRTkStRafMJrqS K8BUEcouHl3bythwVpORk8XxXgNBkdFpbEXm4S46vedUzbdIO3sJKQH2tOcolgrkELPF Vf6Q== X-Gm-Message-State: ALoCoQky9aAurkM+Ws3c1pZ7XexS02FlT25dK2AM/rRvI6s+pk3s/Zk0PqeSVOvatCg832hmKzQa MIME-Version: 1.0 X-Received: by 10.182.250.163 with SMTP id zd3mr39192964obc.20.1392232492381; Wed, 12 Feb 2014 11:14:52 -0800 (PST) Received: by 10.60.46.5 with HTTP; Wed, 12 Feb 2014 11:14:52 -0800 (PST) In-Reply-To: References: <1AA153FA-BB39-4AD8-BF99-47ADAD3F8DD8@yahoo.com> Date: Wed, 12 Feb 2014 11:14:52 -0800 Message-ID: Subject: Re: zookeeper quorum falling apart with continuous leader election From: Deepak Jagtap To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=089e01634ec2274a6e04f23a6344 X-Virus-Checked: Checked by ClamAV on apache.org --089e01634ec2274a6e04f23a6344 Content-Type: text/plain; charset=ISO-8859-1 Thanks Flavio for quick repsonse! Yes, logs for third node with bad disk are missing. Hi Kishore, zk client completely lost the connection to zookeeper so most probably there are no writes on the system. But on node 2 (Leader) I saw some log message indicating successful session establishment from client, hence possibility of writes on the server can not be ignored. Thanks & Regards, Deepak On Wed, Feb 12, 2014 at 9:56 AM, kishore g wrote: > Just for my understanding what do these messages indicate. Also I see that > n.zxid keeps incrementing, does it mean the system is accepting writes? > > node 2 > 2014-02-10 19:49:06,860 [myid:235] - INFO > [WorkerReceiver[myid=235]:FastLeaderElection@594] - Notification: 234 > (n.leader), 0x4afe00000001 (n.zxid), 0x4b00 (n.round), LOOKING (n.state), > 234 (n.sid), 0x4aff (n.peerEPoch), LOOKING (my state)1 (n.config version) > > node 1 > 2014-02-10 19:42:02,936 [myid:234] - INFO > [WorkerReceiver[myid=234]:FastLeaderElection@594] - Notification: 234 > (n.leader), 0x4afa00000001 (n.zxid), 0x4afc (n.round), LOOKING (n.state), > 234 (n.sid), 0x4afb (n.peerEPoch), LOOKING (my state)1 (n.config version) > > > > > On Wed, Feb 12, 2014 at 6:47 AM, Flavio Junqueira >wrote: > > > It sounds like LE is completing periodically, but the servers are not > > being able to complete the synchronization step. We are also getting this > > connection refused exception when the follower is trying to connect. This > > is what I spotted for the follower: > > > > 2014-02-10 18:54:04,414 [myid:234] - INFO > > [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Follower@65] - FOLLOWING - > > LEADER ELECTION TOOK - 1 > > 2014-02-10 18:54:04,415 [myid:234] - WARN > > [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Learner@239] - Unexpected > > exception, tries=0, connecting to 10.0.57.235/10.0.57.235:2888 > > java.net.ConnectException: Connection refused > > at java.net.PlainSocketImpl.socketConnect(Native Method) > > at java.net.AbstractPlainSocketImpl.doConnect(Unknown Source) > > at java.net.AbstractPlainSocketImpl.connectToAddress(Unknown > > Source) > > at java.net.AbstractPlainSocketImpl.connect(Unknown Source) > > at java.net.SocksSocketImpl.connect(Unknown Source) > > at java.net.Socket.connect(Unknown Source) > > at > > > org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:231) > > at > > > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:73) > > at > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:936) > > > > and this: > > > > 2014-02-10 18:55:05,508 [myid:234] - INFO > > [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Learner@442] - Learner > > received UPTODATE message > > 2014-02-10 18:55:05,508 [myid:234] - WARN > > [QuorumPeer[myid=234]/0:0:0:0:0:0:0:0:2181:Follower@92] - Exception > when > > following the leader > > java.net.SocketException: Broken pipe > > at java.net.SocketOutputStream.socketWrite0(Native Method) > > at java.net.SocketOutputStream.socketWrite(Unknown Source) > > at java.net.SocketOutputStream.write(Unknown Source) > > at java.io.BufferedOutputStream.flushBuffer(Unknown Source) > > at java.io.BufferedOutputStream.flush(Unknown Source) > > at > > org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:145) > > at > > > org.apache.zookeeper.server.quorum.Learner.syncWithLeader(Learner.java:477) > > at > > > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) > > at > > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:936) > > > > On the leader side, we have this: > > > > 2014-02-10 19:48:03,705 [myid:235] - INFO > > [LearnerHandler-/10.0.57.234:58829:LearnerHandler@328] - Synchronizing > > with Follower sid: 234 maxCommittedLog=0x4afe00000001 > > minCommittedLog=0x4afe00000001 peerLastZxid=0x4afd00000001 > > 2014-02-10 19:48:03,705 [myid:235] - WARN > > [LearnerHandler-/10.0.57.234:58829:LearnerHandler@389] - Unhandled > > proposal scenario > > 2014-02-10 19:48:03,705 [myid:235] - INFO > > [LearnerHandler-/10.0.57.234:58829:LearnerHandler@404] - Sending SNAP > > 2014-02-10 19:48:03,705 [myid:235] - INFO > > [LearnerHandler-/10.0.57.234:58829:LearnerHandler@435] - Sending > > snapshot last zxid of peer is 0x4afd00000001 zxid of leader is > > 0x4aff00000000sent zxid of db as 0x4afe00000001 > > 2014-02-10 19:48:03,724 [myid:235] - WARN > > [LearnerHandler-/10.0.57.234:58829:Leader@698] - Commiting zxid > > 0x4aff00000000 from /10.0.57.235:2888 not first! > > > > There are a couple of odd warnings there. Just to confirm, the node > > missing in the logs is the one with the bad disk, right? > > > > -Flavio > > > > On 12 Feb 2014, at 02:26, Deepak Jagtap wrote: > > > > > Hi , > > > > > > I have 3 node zookeeper 3.5.0.1458648 quorum on my setup. > > > We came across a situation where one of the zk server in the cluster > went > > > down > > > due to bad disk. > > > We observed that leader election keeps running in loop (starts, > completes > > > and again starts). The loop repeats every couple of minutes. > > > Even restarting zookeeper server on both nodes doesn't help recovering > > from > > > this loop. > > > Network connection looks fine though, as I could telnet leader election > > > port and ssh from one node to other. > > > zookeeper client on each node is using "127.0.0.1:2181" as quorum > string > > > for connecting to server, therefore if local zookeeper server is down > > > client app is dead. > > > > > > I have uploaded zookeeper.log for both nodes at following link: > > > https://dl.dropboxusercontent.com/u/36429721/zkSupportLog.tar.gz > > > > > > Any idea what might be wrong with the quorum? Please note that > restarting > > > zookeeper server on both nodes doesn't help to recover from this > > situations. > > > > > > Thanks & Regards, > > > Deepak > > > > > --089e01634ec2274a6e04f23a6344--