zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jeff Potter <jpotter-zookee...@codepuppy.com>
Subject Re: Failover when one node fails to write on the disk?
Date Thu, 08 Jan 2015 13:17:22 GMT

Ibrahim, this doesn’t sound right to me — with three nodes, one can stop a single node
and a majority still survives. I’ve confirmed this in our 3-node ZK setups in the past.
5 nodes is the recommended, as it allows one to take a single node offline for maintenance,
and then survive another node failing, but that’s not what Benjamin’s email suggests happened.

Benjamin: is it possible there was a partition in your network, where node 2 and node 3 weren’t
able to communicate with each other?

-Jeff


> On Jan 8, 2015, at 7:06 AM, Ibrahim El-sanosi (PGR) <i.s.el-sanosi@newcastle.ac.uk>
wrote:
> 
> Hi Benjamin,
> 
> The reason why Node2 and Node 3 stop running is that ZooKeeper must have a quorum of
servers to make progress. Zookeeper needs at least 3 servers in order to run. In your scenario,
you started with three servers which is fine, but since one of the server fails, the zookeeper
stop running because it lacks of the quorum (majority). 
> 
> Ibrahim
> 
> -----Original Message-----
> From: Benjamin Jaton [mailto:benjamin.jaton@gmail.com] 
> Sent: Wednesday, January 07, 2015 10:34 م
> To: user@zookeeper.apache.org
> Subject: Failover when one node fails to write on the disk?
> 
> Using zookeeper 3.4.5 I came across a situation where all the 3 Zookeeper suddenly stop.
> 
> What I see is that NODE1 fails to write on the disk. so it makes sense to me that NODE1
stops.
> 
> But it is unclear why NODE2 and NODE3 would stop running as well, I have a hard time
making sense of the log messages.
> 
> Any insight would be greatly appreciated!
> 
> see log extracts below:
> 
> NODE1:
> 
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - fsync-ing the
write ahead log in SyncThread:1 took 11024ms which will adversely effect operation latency.
See the ZooKeeper troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the
leader java.io.EOFException
>        at java.io.DataInputStream.readInt(DataInputStream.java:392)
>        at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>        at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>        at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>        at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 
> 
> NODE2:
> 
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the
leader java.io.EOFException
>        at java.io.DataInputStream.readInt(DataInputStream.java:392)
>        at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>        at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>        at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>        at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 
> 
> NODE3 (leader):
> 
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to
peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE
> /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to
peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE
> /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring unexpected exception
java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>        at
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>        at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>        at
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>        at
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,007 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,115 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due
to java.io.IOException: ZooKeeperServer not running
> 
> 
> Thanks!
> Benjamin

Mime
View raw message