Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 3469217D09 for ; Thu, 8 Jan 2015 13:18:53 +0000 (UTC) Received: (qmail 32395 invoked by uid 500); 8 Jan 2015 13:18:53 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 32348 invoked by uid 500); 8 Jan 2015 13:18:53 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 32333 invoked by uid 99); 8 Jan 2015 13:18:51 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 13:18:51 +0000 X-ASF-Spam-Status: No, hits=-0.0 required=5.0 tests=NORMAL_HTTP_TO_IP,SPF_HELO_PASS,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of jpotter-zookeeper@codepuppy.com designates 208.118.232.39 as permitted sender) Received: from [208.118.232.39] (HELO mx.atof.net) (208.118.232.39) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 13:18:47 +0000 X-Spam-Checker-Version: SpamAssassin 3.3.1 (2010-03-16) on tinderblast.atof.net X-Spam-Level: X-Spam-ASN: X-Spam-Report: * -3.5 ALL_TRUSTED Passed through trusted hosts only via SMTP * 0.0 NORMAL_HTTP_TO_IP URI: URI host has a public dotted-decimal IPv4 * address * -1.5 KHOP_THREADED Message references or replies to another message X-Spam-Relay-Country: Content-Type: text/plain; charset=utf-8 Mime-Version: 1.0 (Mac OS X Mail 8.1 \(1993\)) Subject: Re: Failover when one node fails to write on the disk? From: Jeff Potter In-Reply-To: Date: Thu, 8 Jan 2015 08:17:22 -0500 Content-Transfer-Encoding: quoted-printable Message-Id: References: To: user@zookeeper.apache.org X-Mailer: Apple Mail (2.1993) X-Virus-Checked: Checked by ClamAV on apache.org X-Old-Spam-Status: No, score=-5.0 required=5.4 tests=ALL_TRUSTED,KHOP_THREADED, NORMAL_HTTP_TO_IP autolearn=ham version=3.3.1 Ibrahim, this doesn=E2=80=99t sound right to me =E2=80=94 with three = nodes, one can stop a single node and a majority still survives. I=E2=80=99= ve confirmed this in our 3-node ZK setups in the past. 5 nodes is the = recommended, as it allows one to take a single node offline for = maintenance, and then survive another node failing, but that=E2=80=99s = not what Benjamin=E2=80=99s email suggests happened. Benjamin: is it possible there was a partition in your network, where = node 2 and node 3 weren=E2=80=99t able to communicate with each other? -Jeff > On Jan 8, 2015, at 7:06 AM, Ibrahim El-sanosi (PGR) = wrote: >=20 > Hi Benjamin, >=20 > The reason why Node2 and Node 3 stop running is that ZooKeeper must = have a quorum of servers to make progress. Zookeeper needs at least 3 = servers in order to run. In your scenario, you started with three = servers which is fine, but since one of the server fails, the zookeeper = stop running because it lacks of the quorum (majority).=20 >=20 > Ibrahim >=20 > -----Original Message----- > From: Benjamin Jaton [mailto:benjamin.jaton@gmail.com]=20 > Sent: Wednesday, January 07, 2015 10:34 =D9=85 > To: user@zookeeper.apache.org > Subject: Failover when one node fails to write on the disk? >=20 > Using zookeeper 3.4.5 I came across a situation where all the 3 = Zookeeper suddenly stop. >=20 > What I see is that NODE1 fails to write on the disk. so it makes sense = to me that NODE1 stops. >=20 > But it is unclear why NODE2 and NODE3 would stop running as well, I = have a hard time making sense of the log messages. >=20 > Any insight would be greatly appreciated! >=20 > see log extracts below: >=20 > NODE1: >=20 > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] = - fsync-ing the write ahead log in SyncThread:1 took 11024ms which will = adversely effect operation latency. See the ZooKeeper troubleshooting = guide > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception = when following the leader java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > = org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.j= ava:83) > at > = org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)= > at > = org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > = org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)= > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,492 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:24,060 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running >=20 >=20 > NODE2: >=20 > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3D3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception = when following the leader java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > = org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.j= ava:83) > at > = org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)= > at > = org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > = org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)= > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:22,801 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:22,886 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running >=20 >=20 > NODE3 (leader): >=20 > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - = Closing connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* = GOODBYE > /204.53.107.249:43402 ******** > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - = Closing connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* = GOODBYE > /204.53.107.247:45953 ******** > 2015-01-04 16:18:21,918 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring = unexpected exception java.lang.InterruptedException > at > = java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly= (AbstractQueuedSynchronizer.java:1219) > at > = java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.j= ava:340) > at > = java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)= > at > = org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.= java:656) > at > = org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:= 649) > 2015-01-04 16:18:23,003 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,007 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,115 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of = session 0x0 due to java.io.IOException: ZooKeeperServer not running >=20 >=20 > Thanks! > Benjamin