zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Flavio Junqueira <...@apache.org>
Subject Re: Zookeeper leader election takes a long time.
Date Sat, 08 Oct 2016 14:55:23 GMT
Hi Anand,

I don't understand whether 1 and 3 were able or even trying to connect to each other. They
should be able to elect a leader between them and make progress. You might want to upload
logs and let us know.

-Flavio
 
> On 08 Oct 2016, at 02:11, Anand Parthasarathy <anpartha@avinetworks.com> wrote:
> 
> Hi,
> 
> We are currently using zookeeper 3.4.6 version and use a 3 node solution in
> our system. We see that occasionally, when a node is powered off (in this
> instance, it was actually a leader node), the remaining two nodes do not
> form a quorum for a really long time. Looking at the logs, it appears the
> sequence is as follows:
> - Node 2 is the zookeeper leader
> - Node 2 is powered off
> - Node 1 and Node 3 recognize and start the election
> - Node 3 times out after initLimit * tickTime with "Timeout while waiting
> for quorum" for Round N
> - Node 1 times out after initLimit * tickTime with "Exception while trying
> to follow leader" for Round N+1 at the same time.
> - And the process continues where N is sequentially incrementing.
> - This happens for a long time.
> - In one instance, we used tickTime=5000 and initLimit=20 and it took
> around 3.5 hours to converge.
> - In a given round, Node 1 will try connecting to Node 2, gets connection
> refused waits for notification timeout which increases by 2 every iteration
> until it hits the initLimit. Connection Refused is because the node 2 comes
> up after reboot, but zookeeper process is not started (due to a different
> failure).
> 
> It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
> where Node 2 is not reachable.
> 
> Could you pls. share if you have seen this issue and if so, what is the
> workaround that can be employed in 3.4.6.
> 
> Thanks,
> Anand.


Mime
View raw message