zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ken Mamitsuka <...@airtime.com>
Subject Zookeeper leader is terminated in AWS and cluster never recovers
Date Mon, 25 Jan 2016 19:30:03 GMT
This may be related to the post "All nodes shutting down simultaneously",
but I'm not sure. I was able to reproduce this quite a few times. I
terminated the leader of a 5 node zookeeper cluster (terminated from the
AWS console). All the remaining hosts saw the leader disappear:

Jan 20 18:40:05 localhost zookeeper: WARN
[QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out

Then it looks like a shutdown of the ZooKeeperServer and  its processors:

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
down

I’m not sure if we should see startup logs or not. Then we see the leader
election appear to be successful (I’m not sure if the “ZooKeeperServer not
running” log is ominous or not):

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
/zookeeper/data/version-2/snapshot.8005dc366
Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  2, proposed zxid=0x8005e12e8
Jan 20 18:40:05 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
...
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
ELECTION TOOK - 347
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
zookeeper-5-internal.prod.airtime.com/10.1.24.208

However, attempts to connect to the new leader just fail forever:

Jan 20 18:40:06 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=0, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:07 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=1, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:08 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=2, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
Jan 20 18:40:09 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
exception, tries=3, connecting to
zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888

Then it throws an exception and shuts down:

Jan 20 18:40:10 localhost zookeeper: WARN
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
following the leader
Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
Jan 20 18:40:10 localhost     at
org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
Jan 20 18:40:10 localhost     at
org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
Shutting down
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  2, proposed zxid=0x8005e12e8
Jan 20 18:40:10 localhost zookeeper: INFO
 [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)

And then the fast leader election fails essentially forever:

Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
Jan 20 18:40:10 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
Jan 20 18:40:11 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600
Jan 20 18:40:13 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 3200
Jan 20 18:40:16 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 6400
Jan 20 18:40:22 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 12800
Jan 20 18:40:35 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 25600
Jan 20 18:41:01 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 51200
Jan 20 18:41:52 localhost zookeeper: INFO
 [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 60000

Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
it doesn’t seem like it ever recognizes that it was the leader of the
original election. I’m not sure if it’s because it can’t start or something
else.

Jan 20 18:40:05 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
election. My id =  5, proposed zxid=0x8005e12e8
Jan 20 18:40:05 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 400
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:06 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 800
Jan 20 18:40:06 localhost zookeeper: INFO
 [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
Jan 20 18:40:07 localhost zookeeper: INFO
 [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
Notification time out: 1600

A couple of additional notes:

   1. Restarting zookeeper on all hosts seemed to resolve the issue.
   2. This was easily reproducible for me in 3 different zookeeper clusters
   in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
   3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
   anything network related could make this situation fail, while everything
   else worked fine.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message