zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jim Wong <...@airtime.com>
Subject Re: Zookeeper leader is terminated in AWS and cluster never recovers
Date Wed, 27 Jan 2016 06:19:38 GMT
Folks,

Does anyone have any thoughts on this one? We can semi-reliably cause our cluster to get into
a seemingly permanent bad stat--at least several minutes, which is longer than we can tolerate
in production--just by terminating the leader. We’ve seen recommendations to change things
like the amount of time allotted for the database to be reconstructed from the last snapshot,
but none of them have helped.

If nothing else, are there any particular log messages we should be looking for or troubleshooting
strategies we can pursue?



> On Jan 25, 2016, at 11:30 AM, Ken Mamitsuka <ken@airtime.com> wrote:
> 
> This may be related to the post "All nodes shutting down simultaneously",
> but I'm not sure. I was able to reproduce this quite a few times. I
> terminated the leader of a 5 node zookeeper cluster (terminated from the
> AWS console). All the remaining hosts saw the leader disappear:
> 
> Jan 20 18:40:05 localhost zookeeper: WARN
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> Jan 20 18:40:05 localhost java.net.SocketTimeoutException: Read timed out
> 
> Then it looks like a shutdown of the ZooKeeperServer and  its processors:
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@467] - shutting
> down
> 
> I’m not sure if we should see startup logs or not. Then we see the leader
> election appear to be successful (I’m not sure if the “ZooKeeperServer not
> running” log is ominous or not):
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FileSnap@83] - Reading snapshot
> /zookeeper/data/version-2/snapshot.8005dc366
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  2, proposed zxid=0x8005e12e8
> Jan 20 18:40:05 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> ...
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@844] - FOLLOWING
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:ZooKeeperServer@168] - Created
> server with tickTime 2000 minSessionTimeout 4000 maxSessionTimeout 40000
> datadir /zookeeper/data/version-2 snapdir /zookeeper/data/version-2
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@63] - FOLLOWING - LEADER
> ELECTION TOOK - 347
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 4 (n.sid), 0x8 (n.peerEpoch) FOLLOWING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer$QuorumServer@149] -
> Resolved hostname: zookeeper-5-internal.prod.airtime.com to address:
> zookeeper-5-internal.prod.airtime.com/10.1.24.208
> 
> However, attempts to connect to the new leader just fail forever:
> 
> Jan 20 18:40:06 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=0, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:07 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=1, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:08 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=2, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> Jan 20 18:40:09 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Learner@236] - Unexpected
> exception, tries=3, connecting to
> zookeeper-5-internal.prod.airtime.com/10.1.24.208:2888
> 
> Then it throws an exception and shuts down:
> 
> Jan 20 18:40:10 localhost zookeeper: WARN
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when
> following the leader
> Jan 20 18:40:10 localhost java.net.ConnectException: Connection refused
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called
> Jan 20 18:40:10 localhost java.lang.Exception: shutdown Follower
> Jan 20 18:40:10 localhost     at
> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166)
> Jan 20 18:40:10 localhost     at
> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:850)
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FollowerZooKeeperServer@140] -
> Shutting down
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:QuorumPeer@774] - LOOKING
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  2, proposed zxid=0x8005e12e8
> Jan 20 18:40:10 localhost zookeeper: INFO
> [WorkerReceiver[myid=2]:FastLeaderElection@600] - Notification: 1 (message
> format version), 2 (n.leader), 0x8005e12e8 (n.zxid), 0xe2 (n.round),
> LOOKING (n.state), 2 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> 
> And then the fast leader election fails essentially forever:
> 
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> Jan 20 18:40:10 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> Jan 20 18:40:11 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> Jan 20 18:40:13 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 3200
> Jan 20 18:40:16 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 6400
> Jan 20 18:40:22 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 12800
> Jan 20 18:40:35 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 25600
> Jan 20 18:41:01 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 51200
> Jan 20 18:41:52 localhost zookeeper: INFO
> [QuorumPeer[myid=2]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 60000
> 
> Looking at the node elected as the new leader (zookeeper-5.prod.airtime.com),
> it doesn’t seem like it ever recognizes that it was the leader of the
> original election. I’m not sure if it’s because it can’t start or something
> else.
> 
> Jan 20 18:40:05 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@818] - New
> election. My id =  5, proposed zxid=0x8005e12e8
> Jan 20 18:40:05 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 400
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:06 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 800
> Jan 20 18:40:06 localhost zookeeper: INFO
> [WorkerReceiver[myid=5]:FastLeaderElection@600] - Notification: 1 (message
> format version), 5 (n.leader), 0x8005e12e8 (n.zxid), 0xe1 (n.round),
> LOOKING (n.state), 5 (n.sid), 0x8 (n.peerEpoch) LOOKING (my state)
> Jan 20 18:40:07 localhost zookeeper: INFO
> [QuorumPeer[myid=5]/0:0:0:0:0:0:0:0:2181:FastLeaderElection@852] -
> Notification time out: 1600
> 
> A couple of additional notes:
> 
>   1. Restarting zookeeper on all hosts seemed to resolve the issue.
>   2. This was easily reproducible for me in 3 different zookeeper clusters
>   in AWS on Jan 20/21. Other things like killing follower hosts worked fine.
>   3. I wasn't able to reproduce this yesterday (Jan 24th), so I wonder if
>   anything network related could make this situation fail, while everything
>   else worked fine.


Mime
View raw message