zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ben Sherman <bsher...@taser.com>
Subject Leader failed, no quorum was able to be reached.
Date Fri, 16 Sep 2016 21:16:50 GMT
Hi all,

I run a 3 node ZK ensemble and I had a hardware failure of the leader, followed by an outage
of the entire ensemble because no quorum could be reached. I have looked at the logs and can’t
figure out what happened, and would like some help trying to figure out how to fix it or if
there is a bug here.

I’m running 3.4.8 on centos 7.

3 nodes, 1, 2, and 3.  2 was the leader and went down.  The ensemble remained unavailable
until node 2 was brought back online where continues to operate normally.

1 looked as if it participated in an election, figured out that 3 would be the leader, then
tried to connect but got a connection refused on port 2888, which is what I understand how
the ensemble members use to communicate with each other about quorums.

I also don’t see 3 attempting to connect to 1, which I find alarming.

Any ideas?


Logs from Node 1:

2016-09-15 14:00:46,812 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89]
- Exception when following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,812 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166]
- shutdown called
java.lang.Exception: shutdown Follower
        at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140]
- Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469]
- shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107]
- Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184]
- Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415]
- shutdown of request processor complete
2016-09-15 14:00:46,817 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209]
- Shutting down
2016-09-15 14:00:46,817 [myid:1] - INFO  [SyncThread:1:SyncRequestProcessor@187] - SyncRequestProcessor
exited!
2016-09-15 14:00:46,817 [myid:1] - INFO  [FollowerRequestProcessor:1:FollowerRequestProcessor@97]
- FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,817 [myid:1] - INFO  [CommitProcessor:1:CommitProcessor@153] - CommitProcessor
exited loop!
2016-09-15 14:00:46,818 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774]
- LOOKING
2016-09-15 14:00:46,883 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FileSnap@83]
- Reading snapshot /opt/data-zookeeper/version-2/snapshot.2b0004b82a
2016-09-15 14:00:49,194 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,395 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,377 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818]
- New election. My id =  1, proposed zxid=0x2b00061453
2016-09-15 14:00:57,378 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@600] -
Notification: 1 (message format version), 1 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,378 [myid:1] - INFO  [WorkerReceiver[myid=1]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 1 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:57,580 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844]
- FOLLOWING
2016-09-15 14:00:57,580 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@170]
- Created server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000 datadir
/opt/data-zookeeper/version-2 snapdir /opt/data-zookeeper/version-2
2016-09-15 14:00:57,580 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@63]
- FOLLOWING - LEADER ELECTION TOOK - 10762
2016-09-15 14:00:57,621 [myid:1] - INFO  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149]
- Resolved hostname: zoo3 to address: zoo3/10.10.10.102
2016-09-15 14:00:57,622 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236]
- Unexpected exception, tries=0, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)
        at java.net.Socket.connect(Socket.java:589)
        at org.apache.zookeeper.server.quorum.Learner.connectToLeader(Learner.java:228)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:71)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:58,623 [myid:1] - WARN  [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Learner@236]
- Unexpected exception, tries=1, connecting to zoo3/10.10.10.102:2888
java.net.ConnectException: Connection refused
        at java.net.PlainSocketImpl.socketConnect(Native Method)
        at java.net.AbstractPlainSocketImpl.doConnect(AbstractPlainSocketImpl.java:350)
        at java.net.AbstractPlainSocketImpl.connectToAddress(AbstractPlainSocketImpl.java:206)
        at java.net.AbstractPlainSocketImpl.connect(AbstractPlainSocketImpl.java:188)
        at java.net.SocksSocketImpl.connect(SocksSocketImpl.java:392)

Logs from node 3:
2016-09-15 14:00:46,792 [myid:3] - WARN  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@89]
- Exception when following the leader
java.net.SocketTimeoutException: Read timed out
        at java.net.SocketInputStream.socketRead0(Native Method)
        at java.net.SocketInputStream.socketRead(SocketInputStream.java:116)
        at java.net.SocketInputStream.read(SocketInputStream.java:170)
        at java.net.SocketInputStream.read(SocketInputStream.java:141)
        at java.io.BufferedInputStream.fill(BufferedInputStream.java:246)
        at java.io.BufferedInputStream.read(BufferedInputStream.java:265)
        at java.io.DataInputStream.readInt(DataInputStream.java:387)
        at org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63)
        at org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83)
        at org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99)
        at org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153)
        at org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:846)
2016-09-15 14:00:46,793 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:Follower@166]
- shutdown called
java.lang.Exception: shutdown Follower
        at org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
        at org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140]
- Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@469]
- shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FollowerRequestProcessor@107]
- Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:CommitProcessor@184]
- Shutting down
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FinalRequestProcessor@415]
- shutdown of request processor complete
2016-09-15 14:00:46,798 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:SyncRequestProcessor@209]
- Shutting down
2016-09-15 14:00:46,799 [myid:3] - INFO  [SyncThread:3:SyncRequestProcessor@187] - SyncRequestProcessor
exited!
2016-09-15 14:00:46,799 [myid:3] - INFO  [FollowerRequestProcessor:3:FollowerRequestProcessor@97]
- FollowerRequestProcessor exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO  [CommitProcessor:3:CommitProcessor@153] - CommitProcessor
exited loop!
2016-09-15 14:00:46,799 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774]
- LOOKING
2016-09-15 14:00:46,838 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FileSnap@83]
- Reading snapshot /opt/data-zookeeper/version-2/snapshot.2b0004fab9
2016-09-15 14:00:49,193 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818]
- New election. My id =  3, proposed zxid=0x2b00061453
2016-09-15 14:00:49,194 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,394 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 400
2016-09-15 14:00:49,394 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:49,794 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 800
2016-09-15 14:00:49,795 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:50,595 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 1600
2016-09-15 14:00:50,595 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:52,195 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 3200
2016-09-15 14:00:52,196 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:00:55,396 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 6400
2016-09-15 14:00:55,396 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:01,796 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 12800
2016-09-15 14:01:01,796 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:14,597 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 25600
2016-09-15 14:01:14,597 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
2016-09-15 14:01:40,197 [myid:3] - INFO  [QuorumPeer[myid=3]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852]
- Notification time out: 51200
2016-09-15 14:01:40,197 [myid:3] - INFO  [WorkerReceiver[myid=3]:FastLeaderElection@600] -
Notification: 1 (message format version), 3 (n.leader), 0x2b00061453 (n.zxid), 0xd (n.round),
LOOKING (n.state), 3 (n.sid), 0x2b (n.peerEpoch) LOOKING (my state)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client
has closed socket
        at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client
has closed socket
        at java.lang.Thread.run(Thread.java:745)
EndOfStreamException: Unable to read additional data from client sessionid 0x0, likely client
has closed socket
        at java.lang.Thread.run(Thread.java:745)

Mime
View raw message