zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: Follower drops out of quorum, can't reconnect
Date Wed, 10 May 2017 19:38:14 GMT
Have you tried increasing syncLimit?

====================
Jordan Zimmerman

> On May 10, 2017, at 9:29 PM, Ben Sherman <bensherman@gmail.com> wrote:
> 
> I have been chasing this problem for months and I can't figure out what is
> happening.
> 
> We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well),
> and sometimes, a node gets a read timeout, drops all the connections and
> tries to re-establish itself to the quorum.  It can usually do this in a
> few seconds, but last night it took almost 15 minutes to reconnect.
> 
> These are 5 servers in AWS, and we've tried tuning the timeouts, but the
> are exceeding any reasonable timeout and still failing.
> 
> Any hints here?
> 
> 2017-05-10 11:21:34,386 [myid:5] - WARN
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> java.net.SocketTimeoutException: Read timed out
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
>        at
> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
>        at
> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
>        at
> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
>        at
> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
> 2017-05-10 11:21:34,387 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> java.lang.Exception: shutdown Follower
>        at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
>        at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
> 2017-05-10 11:21:34,387 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> 2017-05-10 11:21:34,388 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> socket connection for client /192.168.23.109:40666 which had sessionid
> 0x55bd553c076002d
> [snip .. zk closes all its connections]
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
> socket connection for client /192.168.23.88:65025 which had sessionid
> 0x55bd553c0760765
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> Shutting down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting
> down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
> Shutting down
> 2017-05-10 11:21:34,391 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
> down
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
> shutdown of request processor complete
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
> Shutting down
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
> 2017-05-10 11:21:34,392 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  5, proposed zxid=0x4100013987
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:34,393 [myid:5] - INFO
> [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
> FollowerRequestProcessor exited loop!
> 2017-05-10 11:21:34,593 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> 2017-05-10 11:21:34,594 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:34,994 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> 2017-05-10 11:21:34,994 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:35,794 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> 2017-05-10 11:21:35,795 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:37,395 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 3200
> 2017-05-10 11:21:37,395 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:40,595 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 6400
> 2017-05-10 11:21:40,596 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:46,996 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 12800
> 2017-05-10 11:21:46,996 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:21:59,796 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 25600
> 2017-05-10 11:21:59,797 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> ---
> MANY minutes later, we see things finally timeout, connections are
> re-established, and everything comes back:
> ---
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 3, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 2, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,469 [myid:5] - WARN
> [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 4, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,471 [myid:5] - WARN
> [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:02,470 [myid:5] - WARN
> [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,471 [myid:5] - WARN
> [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:02,472 [myid:5] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:02,472 [myid:5] - WARN
> [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:03,493 [myid:5] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
> 1, my id = 5, error =
> java.net.SocketException: Connection timed out (Read failed)
>        at java.net.SocketInputStream.socketRead0(Native Method)
>        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
>        at java.net.SocketInputStream.read(SocketInputStream.java:171)
>        at java.net.SocketInputStream.read(SocketInputStream.java:141)
>        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
>        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
>        at java.io.DataInputStream.readInt(DataInputStream.java:387)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
> 2017-05-10 11:37:03,493 [myid:5] - WARN
> [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
> 2017-05-10 11:37:03,494 [myid:5] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
> for message on queue
> java.lang.InterruptedException
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
>        at
> java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
>        at
> java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
>        at
> org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
> 2017-05-10 11:37:03,494 [myid:5] - WARN
> [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
> 2017-05-10 11:37:16,607 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,611 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
> 2017-05-10 11:37:16,613 [myid:5] - INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER
> ELECTION TOOK - 942220
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
> 2017-05-10 11:37:16,615 [myid:5] - INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
> FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)

Mime
View raw message