zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ibrahim El-sanosi (PGR)" <i.s.el-san...@newcastle.ac.uk>
Subject RE: Failover when one node fails to write on the disk?
Date Thu, 08 Jan 2015 17:54:20 GMT
Yes, correct.

Ibrahim

-----Original Message-----
From: Sékine Coulibaly [mailto:scoulibaly@gmail.com] 
Sent: Thursday, January 08, 2015 12:36 م
To: user@zookeeper.apache.org
Subject: Re: Failover when one node fails to write on the disk?

Ibrahim,
So, the minimum number of zk nodes is 5, not three as is commonly thought.
Right ?
With 5 nodes, one can support one or two nodes failures.
Neither did I expect a 3 nodes cluster to stop with one node failing since there still is
a majority...
Hmmm, will Check this !

Le jeudi 8 janvier 2015, Ibrahim El-sanosi (PGR) < i.s.el-sanosi@newcastle.ac.uk> a
écrit :

> Hi Benjamin,
>
> The reason why Node2 and Node 3 stop running is that ZooKeeper must 
> have a quorum of servers to make progress. Zookeeper needs at least 3 
> servers in order to run. In your scenario, you started with three 
> servers which is fine, but since one of the server fails, the 
> zookeeper stop running because it lacks of the quorum (majority).
>
> Ibrahim
>
> -----Original Message-----
> From: Benjamin Jaton [mailto:benjamin.jaton@gmail.com <javascript:;>]
> Sent: Wednesday, January 07, 2015 10:34 م
> To: user@zookeeper.apache.org <javascript:;>
> Subject: Failover when one node fails to write on the disk?
>
> Using zookeeper 3.4.5 I came across a situation where all the 3 
> Zookeeper suddenly stop.
>
> What I see is that NODE1 fails to write on the disk. so it makes sense 
> to me that NODE1 stops.
>
> But it is unclear why NODE2 and NODE3 would stop running as well, I 
> have a hard time making sense of the log messages.
>
> Any insight would be greatly appreciated!
>
> see log extracts below:
>
> NODE1:
>
> -- no log for several days before this --
> 2015-01-04 16:18:22,259 [myid:1] - WARN  [SyncThread:1:FileTxnLog@321] 
> - fsync-ing the write ahead log in SyncThread:1 took 11024ms which 
> will adversely effect operation latency. See the ZooKeeper 
> troubleshooting guide
> 2015-01-04 16:18:22,380 [myid:1] - WARN 
> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:23,384 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,492 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:24,060 [myid:1] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>
>
> NODE2:
>
> -- no log for several days before this --
> 2015-01-04 16:18:21,899 [myid:3] - WARN 
> [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when 
> following the leader java.io.EOFException
>         at java.io.DataInputStream.readInt(DataInputStream.java:392)
>         at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>         at
>
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>         at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:103)
>         at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>         at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>         at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:786)
> 2015-01-04 16:18:22,760 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,801 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:22,886 [myid:3] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>
>
> NODE3 (leader):
>
> -- no log for several days before this --
> 2015-01-04 16:18:21,897 [myid:2] - WARN 
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,898 [myid:2] - WARN 
> [LearnerHandler-/204.53.107.249:43402:LearnerHandler@646] - ******* 
> GOODBYE
> /204.53.107.249:43402 ********
> 2015-01-04 16:18:21,905 [myid:2] - WARN 
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:LearnerHandler@687] - Closing 
> connection to peer due to transaction timeout.
> 2015-01-04 16:18:21,907 [myid:2] - WARN 
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@646] - ******* 
> GOODBYE
> /204.53.107.247:45953 ********
> 2015-01-04 16:18:21,918 [myid:2] - WARN 
> [LearnerHandler-/204.53.107.247:45953:LearnerHandler@658] - Ignoring 
> unexpected exception java.lang.InterruptedException
>         at
>
> java.util.concurrent.locks.AbstractQueuedSynchronizer.acquireInterruptibly(AbstractQueuedSynchronizer.java:1219)
>         at
>
> java.util.concurrent.locks.ReentrantLock.lockInterruptibly(ReentrantLock.java:340)
>         at
> java.util.concurrent.LinkedBlockingQueue.put(LinkedBlockingQueue.java:338)
>         at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.shutdown(LearnerHandler.java:656)
>         at
>
> org.apache.zookeeper.server.quorum.LearnerHandler.run(LearnerHandler.j
> ava:649)
> 2015-01-04 16:18:23,003 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,007 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
> 2015-01-04 16:18:23,115 [myid:2] - WARN  [NIOServerCxn.Factory:
> 0.0.0.0/0.0.0.0:2181:NIOServerCnxn@362] - Exception causing close of 
> session 0x0 due to java.io.IOException: ZooKeeperServer not running
>
>
> Thanks!
> Benjamin
>
Mime
View raw message