zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Parthasarathy <anpar...@avinetworks.com>
Subject Zookeeper leader election takes a long time.
Date Sat, 08 Oct 2016 01:11:36 GMT
Hi,

We are currently using zookeeper 3.4.6 version and use a 3 node solution in
our system. We see that occasionally, when a node is powered off (in this
instance, it was actually a leader node), the remaining two nodes do not
form a quorum for a really long time. Looking at the logs, it appears the
sequence is as follows:
- Node 2 is the zookeeper leader
- Node 2 is powered off
- Node 1 and Node 3 recognize and start the election
- Node 3 times out after initLimit * tickTime with "Timeout while waiting
for quorum" for Round N
- Node 1 times out after initLimit * tickTime with "Exception while trying
to follow leader" for Round N+1 at the same time.
- And the process continues where N is sequentially incrementing.
- This happens for a long time.
- In one instance, we used tickTime=5000 and initLimit=20 and it took
around 3.5 hours to converge.
- In a given round, Node 1 will try connecting to Node 2, gets connection
refused waits for notification timeout which increases by 2 every iteration
until it hits the initLimit. Connection Refused is because the node 2 comes
up after reboot, but zookeeper process is not started (due to a different
failure).

It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
where Node 2 is not reachable.

Could you pls. share if you have seen this issue and if so, what is the
workaround that can be employed in 3.4.6.

Thanks,
Anand.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message