zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Parthasarathy <anpar...@avinetworks.com>
Subject Re: Zookeeper leader election takes a long time.
Date Sat, 08 Oct 2016 17:04:15 GMT
Yes, we had noted in the past that if the node2 comes back up, it
converges. In this case, in node2, zookeeper could not be started and in
this case, it took 3 hours to converge.

Thanks,
Anand.

On Sat, Oct 8, 2016 at 9:50 AM, Ben Sherman <bensherman@gmail.com> wrote:

> I had this exact problem but couldn't wait for them to reestablish
> connection and when node 2 came back up, everything recovered.  It was with
> 3.4.8 - I posted logs in a previous message.
>
> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <fpj@apache.org> wrote:
>
> > Hi Anand,
> >
> > I don't understand whether 1 and 3 were able or even trying to connect to
> > each other. They should be able to elect a leader between them and make
> > progress. You might want to upload logs and let us know.
> >
> > -Flavio
> >
> > > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> anpartha@avinetworks.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > We are currently using zookeeper 3.4.6 version and use a 3 node
> solution
> > in
> > > our system. We see that occasionally, when a node is powered off (in
> this
> > > instance, it was actually a leader node), the remaining two nodes do
> not
> > > form a quorum for a really long time. Looking at the logs, it appears
> the
> > > sequence is as follows:
> > > - Node 2 is the zookeeper leader
> > > - Node 2 is powered off
> > > - Node 1 and Node 3 recognize and start the election
> > > - Node 3 times out after initLimit * tickTime with "Timeout while
> waiting
> > > for quorum" for Round N
> > > - Node 1 times out after initLimit * tickTime with "Exception while
> > trying
> > > to follow leader" for Round N+1 at the same time.
> > > - And the process continues where N is sequentially incrementing.
> > > - This happens for a long time.
> > > - In one instance, we used tickTime=5000 and initLimit=20 and it took
> > > around 3.5 hours to converge.
> > > - In a given round, Node 1 will try connecting to Node 2, gets
> connection
> > > refused waits for notification timeout which increases by 2 every
> > iteration
> > > until it hits the initLimit. Connection Refused is because the node 2
> > comes
> > > up after reboot, but zookeeper process is not started (due to a
> different
> > > failure).
> > >
> > > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
> > > where Node 2 is not reachable.
> > >
> > > Could you pls. share if you have seen this issue and if so, what is the
> > > workaround that can be employed in 3.4.6.
> > >
> > > Thanks,
> > > Anand.
> >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message