Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2E76C17B7E for ; Thu, 8 Jan 2015 12:36:33 +0000 (UTC) Received: (qmail 46521 invoked by uid 500); 8 Jan 2015 12:36:33 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 46474 invoked by uid 500); 8 Jan 2015 12:36:33 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 46457 invoked by uid 99); 8 Jan 2015 12:36:32 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 12:36:32 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of scoulibaly@gmail.com designates 209.85.212.179 as permitted sender) Received: from [209.85.212.179] (HELO mail-wi0-f179.google.com) (209.85.212.179) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 08 Jan 2015 12:36:28 +0000 Received: by mail-wi0-f179.google.com with SMTP id ex7so2976793wid.0 for ; Thu, 08 Jan 2015 04:36:07 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20120113; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=OaY/1NdJNA307G/1eHTrN3dih7MNAoGWzNlNyyFacU4=; b=dqgDDjxcGwj//PaUvjb0z6XDB3yw4+rx9iZBcYLRpLUaZ/p96Nx8+C67VRuFFe0fa2 XILBlldNCQ6upKjAzTugJRYJS+OAckhKpZNyhJYajaTjfE3ksk+vGCP0QVUagorLUcR6 refAQosXAKUNz0wnt6Z5Q54jPFdVskN0cYq1gpXD6QVvH4rCiYVCeJ9KE68EVp6zLXva yMSizH4Nn1Bm2XtzY/3WtQHrUF1jscfiNqCQYQtRRZhtmlDyOoxUEGHfJ+48MvaCvnW2 Vu1q7AqTt695u43V9gwK6aDgqAdxCb7xuGFmoA9v99f++ABS+dxgiGGK+5djrCCQaAk8 Td9Q== MIME-Version: 1.0 X-Received: by 10.194.94.1 with SMTP id cy1mr18267451wjb.127.1420720567317; Thu, 08 Jan 2015 04:36:07 -0800 (PST) Received: by 10.217.126.211 with HTTP; Thu, 8 Jan 2015 04:36:07 -0800 (PST) In-Reply-To: References: Date: Thu, 8 Jan 2015 13:36:07 +0100 Message-ID: Subject: Re: Failover when one node fails to write on the disk? From: =?UTF-8?Q?S=C3=A9kine_Coulibaly?= To: "user@zookeeper.apache.org" Content-Type: multipart/alternative; boundary=047d7bf0c39abd7cfb050c234899 X-Virus-Checked: Checked by ClamAV on apache.org --047d7bf0c39abd7cfb050c234899 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable Ibrahim, So, the minimum number of zk nodes is 5, not three as is commonly thought. Right ? With 5 nodes, one can support one or two nodes failures. Neither did I expect a 3 nodes cluster to stop with one node failing since there still is a majority... Hmmm, will Check this ! Le jeudi 8 janvier 2015, Ibrahim El-sanosi (PGR) < i.s.el-sanosi@newcastle.ac.uk> a =C3=A9crit : > Hi Benjamin, > > The reason why Node2 and Node 3 stop running is that ZooKeeper must have = a > quorum of servers to make progress. Zookeeper needs at least 3 servers in > order to run. In your scenario, you started with three servers which is > fine, but since one of the server fails, the zookeeper stop running becau= se > it lacks of the quorum (majority). > > Ibrahim > > -----Original Message----- > From: Benjamin Jaton [mailto:benjamin.jaton@gmail.com ] > Sent: Wednesday, January 07, 2015 10:34 =D9=85 > To: user@zookeeper.apache.org > Subject: Failover when one node fails to write on the disk? > > Using zookeeper 3.4.5 I came across a situation where all the 3 Zookeeper > suddenly stop. > > What I see is that NODE1 fails to write on the disk. so it makes sense to > me that NODE1 stops. > > But it is unclear why NODE2 and NODE3 would stop running as well, I have = a > hard time making sense of the log messages. > > Any insight would be greatly appreciated! > > see log extracts below: > > NODE1: > > -- no log for several days before this -- > 2015-01-04 16:18:22,259 [myid:1] - WARN [SyncThread:1:FileTxnLog@321] - > fsync-ing the write ahead log in SyncThread:1 took 11024ms which will > adversely effect operation latency. See the ZooKeeper troubleshooting gui= de > 2015-01-04 16:18:22,380 [myid:1] - WARN > [QuorumPeer[myid=3D1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.= java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103= ) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85= ) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:23,384 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,492 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:24,060 [myid:1] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > > > NODE2: > > -- no log for several days before this -- > 2015-01-04 16:18:21,899 [myid:3] - WARN > [QuorumPeer[myid=3D3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when > following the leader java.io.EOFException > at java.io.DataInputStream.readInt(DataInputStream.java:392) > at > org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) > at > > org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.= java:83) > at > org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103= ) > at > org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) > at > org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85= ) > at > org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786) > 2015-01-04 16:18:22,760 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:22,801 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:22,886 [myid:3] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > > > NODE3 (leader): > > -- no log for several days before this -- > 2015-01-04 16:18:21,897 [myid:2] - WARN > [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,898 [myid:2] - WARN > [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* > GOODBYE > /204.53.107.249:43402 ******** > 2015-01-04 16:18:21,905 [myid:2] - WARN > [QuorumPeer[myid=3D2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing > connection to peer due to transaction timeout. > 2015-01-04 16:18:21,907 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* > GOODBYE > /204.53.107.247:45953 ******** > 2015-01-04 16:18:21,918 [myid:2] - WARN > [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring > unexpected exception java.lang.InterruptedException > at > > java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibl= y(AbstractQueuedSynchronizer.java:1219) > at > > java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.= java:340) > at > java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338= ) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler= .java:656) > at > > org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java= :649) > 2015-01-04 16:18:23,003 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,007 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > 2015-01-04 16:18:23,115 [myid:2] - WARN [NIOServerCxn.Factory: > 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of > session 0x0 due to java.io.IOException: ZooKeeperServer not running > > > Thanks! > Benjamin > --047d7bf0c39abd7cfb050c234899--