zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Parthasarathy <anpar...@avinetworks.com>
Subject Re: Zookeeper leader election takes a long time.
Date Wed, 12 Oct 2016 00:33:54 GMT
Hi Ben,

We have only tried with 3.4.6. We can potentially try with 3.4.8 or 3.4.9.

Thanks,
Anand.

On Tue, Oct 11, 2016 at 5:29 PM, Ben Sherman <bensherman@gmail.com> wrote:

> Anand, in your lab are you able to replicate this with 3.4.8 or 3.4.9?
>
> On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy <
> anpartha@avinetworks.com> wrote:
>
> > Folks,
> >
> > Any insight into this or any workarounds that you can think of to
> mitigate
> > against this issue? We have isolated it to a test setup, where we are
> able
> > to reproduce this somewhat consistently if we keep a node powered off.
> >
> > Thanks,
> > Anand.
> >
> > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy <
> > anpartha@avinetworks.com> wrote:
> >
> > > Hi Flavio,
> > >
> > > I have attached the logs from node 1 and node 3. Node 2 was powered off
> > > around 10-03 12:36. Leader election kept going until 10-03 15:57:16
> when
> > it
> > > finally converged.
> > >
> > > Thanks,
> > > Anand.
> > >
> > > On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <fpj@apache.org>
> wrote:
> > >
> > >> Hi Anand,
> > >>
> > >> I don't understand whether 1 and 3 were able or even trying to connect
> > to
> > >> each other. They should be able to elect a leader between them and
> make
> > >> progress. You might want to upload logs and let us know.
> > >>
> > >> -Flavio
> > >>
> > >> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> > anpartha@avinetworks.com>
> > >> wrote:
> > >> >
> > >> > Hi,
> > >> >
> > >> > We are currently using zookeeper 3.4.6 version and use a 3 node
> > >> solution in
> > >> > our system. We see that occasionally, when a node is powered off (in
> > >> this
> > >> > instance, it was actually a leader node), the remaining two nodes
do
> > not
> > >> > form a quorum for a really long time. Looking at the logs, it
> appears
> > >> the
> > >> > sequence is as follows:
> > >> > - Node 2 is the zookeeper leader
> > >> > - Node 2 is powered off
> > >> > - Node 1 and Node 3 recognize and start the election
> > >> > - Node 3 times out after initLimit * tickTime with "Timeout while
> > >> waiting
> > >> > for quorum" for Round N
> > >> > - Node 1 times out after initLimit * tickTime with "Exception while
> > >> trying
> > >> > to follow leader" for Round N+1 at the same time.
> > >> > - And the process continues where N is sequentially incrementing.
> > >> > - This happens for a long time.
> > >> > - In one instance, we used tickTime=5000 and initLimit=20 and it
> took
> > >> > around 3.5 hours to converge.
> > >> > - In a given round, Node 1 will try connecting to Node 2, gets
> > >> connection
> > >> > refused waits for notification timeout which increases by 2 every
> > >> iteration
> > >> > until it hits the initLimit. Connection Refused is because the node
> 2
> > >> comes
> > >> > up after reboot, but zookeeper process is not started (due to a
> > >> different
> > >> > failure).
> > >> >
> > >> > It looks similar to ZOOKEEPER-2164 but there it is a connection
> > timeout
> > >> > where Node 2 is not reachable.
> > >> >
> > >> > Could you pls. share if you have seen this issue and if so, what is
> > the
> > >> > workaround that can be employed in 3.4.6.
> > >> >
> > >> > Thanks,
> > >> > Anand.
> > >>
> > >>
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message