zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ibrahim El-sanosi (PGR)" <i.s.el-san...@newcastle.ac.uk>
Subject RE: Failover when one node fails to write on the disk?
Date Thu, 08 Jan 2015 12:06:43 GMT
Hi Benjamin,

The reason why Node2 and Node 3 stop running is that ZooKeeper must have a quorum of servers
to make progress. Zookeeper needs at least 3 servers in order to run. In your scenario, you
started with three servers which is fine, but since one of the server fails, the zookeeper
stop running because it lacks of the quorum (majority). 

Ibrahim

-----Original Message-----
From: Benjamin Jaton [mailto:benjamin.jaton@gmail.com] 
Sent: Wednesday, January 07, 2015 10:34 م
To: user@zookeeper.apache.org
Subject: Failover when one node fails to write on the disk?

Using zookeeper 3.4.5 I came across a situation where all the 3 Zookeeper suddenly stop.

What I see is that NODE1 fails to write on the disk. so it makes sense to me that NODE1 stops.

But it is unclear why NODE2 and NODE3 would stop running as well, I have a hard time making
sense of the log messages.

Any insight would be greatly appreciated!

see log extracts below:

NODE1:

-- no log for several days before this --
2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] - fsync-ing the write
ahead log in SyncThread:1 took 11024ms which will adversely effect operation latency. See
the ZooKeeper troubleshooting guide
2015-01-04 16:18:22,380 [myid:1] - WARN
[QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
        at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2015-01-04 16:18:23,384 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:23,492 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:24,060 [myid:1] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running


NODE2:

-- no log for several days before this --
2015-01-04 16:18:21,899 [myid:3] - WARN
[QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when following the leader
java.io.EOFException
        at java.io.DataInputStream.readInt(DataInputStream.java:392)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
        at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
2015-01-04 16:18:22,760 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:22,801 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:22,886 [myid:3] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running


NODE3 (leader):

-- no log for several days before this --
2015-01-04 16:18:21,897 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to peer
due to transaction timeout.
2015-01-04 16:18:21,898 [myid:2] - WARN
[LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* GOODBYE
/204.53.107.249:43402 ********
2015-01-04 16:18:21,905 [myid:2] - WARN
[QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing connection to peer
due to transaction timeout.
2015-01-04 16:18:21,907 [myid:2] - WARN
[LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* GOODBYE
/204.53.107.247:45953 ********
2015-01-04 16:18:21,918 [myid:2] - WARN
[LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring unexpected exception
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
        at
java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
        at
java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
        at
org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.java:649)
2015-01-04 16:18:23,003 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:23,007 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running
2015-01-04 16:18:23,115 [myid:2] - WARN  [NIOServerCxn.Factory:
0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of session 0x0 due to java.io.IOException:
ZooKeeperServer not running


Thanks!
Benjamin
Mime
View raw message