zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Sherman <bensher...@gmail.com>
Subject Follower drops out of quorum, can't reconnect
Date Wed, 10 May 2017 19:29:43 GMT
I have been chasing this problem for months and I can't figure out what is
happening.

We have a 5 node cluster running 3.4.10 we saw this in .8 and .9 as well),
and sometimes, a node gets a read timeout, drops all the connections and
tries to re-establish itself to the quorum.  It can usually do this in a
few seconds, but last night it took almost 15 minutes to reconnect.

These are 5 servers in AWS, and we've tried tuning the timeouts, but the
are exceeding any reasonable timeout and still failing.

Any hints here?

2017-05-10 11:21:34,386 [myid:5] - WARN
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at
org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at
org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at
org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at
org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937)
2017-05-10 11:21:34,387 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
java.lang.Exception: shutdown Follower
        at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941)
2017-05-10 11:21:34,387 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
2017-05-10 11:21:34,388 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
socket connection for client /192.168.23.109:40666 which had sessionid
0x55bd553c076002d
[snip .. zk closes all its connections]
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:NIOServerCnxn@1044] - Closed
socket connection for client /192.168.23.88:65025 which had sessionid
0x55bd553c0760765
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@505] - shutting
down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107] -
Shutting down
2017-05-10 11:21:34,391 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184] - Shutting
down
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@402] -
shutdown of request processor complete
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@208] -
Shutting down
2017-05-10 11:21:34,392 [myid:5] - INFO
 [CommitProcessor:5:CommitProcessor@153] - CommitProcessor exited loop!
2017-05-10 11:21:34,392 [myid:5] - INFO
 [SyncThread:5:SyncRequestProcessor@186] - SyncRequestProcessor exited!
2017-05-10 11:21:34,392 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@865] - LOOKING
2017-05-10 11:21:34,393 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  5, proposed zxid=0x4100013987
2017-05-10 11:21:34,393 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:34,393 [myid:5] - INFO
 [FollowerRequestProcessor:5:FollowerRequestProcessor@97] -
FollowerRequestProcessor exited loop!
2017-05-10 11:21:34,593 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
2017-05-10 11:21:34,594 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:34,994 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
2017-05-10 11:21:34,994 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:35,794 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600
2017-05-10 11:21:35,795 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:37,395 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 3200
2017-05-10 11:21:37,395 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:40,595 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 6400
2017-05-10 11:21:40,596 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:46,996 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 12800
2017-05-10 11:21:46,996 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:21:59,796 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 25600
2017-05-10 11:21:59,797 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
---
MANY minutes later, we see things finally timeout, connections are
re-established, and everything comes back:
---
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:3:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
3, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:3:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:2:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
2, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:2:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,469 [myid:5] - WARN
 [RecvWorker:4:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
4, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:02,470 [myid:5] - WARN
 [RecvWorker:4:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:02,470 [myid:5] - WARN
 [SendWorker:3:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,471 [myid:5] - WARN
 [SendWorker:3:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:02,470 [myid:5] - WARN
 [SendWorker:4:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,471 [myid:5] - WARN
 [SendWorker:4:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:02,472 [myid:5] - WARN
 [SendWorker:2:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:02,472 [myid:5] - WARN
 [SendWorker:2:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:03,493 [myid:5] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@1028] - Connection broken for id
1, my id = 5, error =
java.net.SocketException: Connection timed out (Read failed)
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:171)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$RecvWorker.run(QuorumCnxManager.java:1013)
2017-05-10 11:37:03,493 [myid:5] - WARN
 [RecvWorker:1:QuorumCnxManager$RecvWorker@1031] - Interrupting SendWorker
2017-05-10 11:37:03,494 [myid:5] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@944] - Interrupted while waiting
for message on queue
java.lang.InterruptedException
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.reportInterruptAfterWait(AbstractQueuedSynchronizer.java:2014)
        at
java.util.concurrent.locks.AbstractQueuedSynchronizer$ConditionObject.awaitNanos(AbstractQueuedSynchronizer.java:2088)
        at
java.util.concurrent.ArrayBlockingQueue.poll(ArrayBlockingQueue.java:418)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.pollSendQueue(QuorumCnxManager.java:1097)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager.access$700(QuorumCnxManager.java:74)
        at
org.apache.zookeeper.server.quorum.QuorumCnxManager$SendWorker.run(QuorumCnxManager.java:932)
2017-05-10 11:37:03,494 [myid:5] - WARN
 [SendWorker:1:QuorumCnxManager$SendWorker@954] - Send worker leaving thread
2017-05-10 11:37:16,607 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 60000
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 1 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,611 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 2 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
LEADING (n.state), 3 (n.sid), 0x41 (n.peerEpoch) LOOKING (my state)
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:QuorumPeer@935] - FOLLOWING
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@173] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
2017-05-10 11:37:16,613 [myid:5] - INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@64] - FOLLOWING - LEADER
ELECTION TOOK - 942220
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x4100013987 (n.zxid), 0x20 (n.round),
LOOKING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 5 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)
2017-05-10 11:37:16,615 [myid:5] - INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 3 (n.leader), 0x40000320fb (n.zxid), 0x1f (n.round),
FOLLOWING (n.state), 4 (n.sid), 0x41 (n.peerEpoch) FOLLOWING (my state)

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message