zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Anand Parthasarathy <anpar...@avinetworks.com>
Subject Re: Zookeeper leader election takes a long time.
Date Thu, 13 Oct 2016 17:09:44 GMT
Hi Michael,

We have reproduced this issue on a private AWS setup that has public IP
access. I will send you the details of the instance IP and the credentials
separately. If it needs to be shared with more people, I am happy to share
with them as well.

Thanks
Anand.

On Tue, Oct 11, 2016 at 3:46 PM, Michael Han <hanm@cloudera.com> wrote:

> Hi Anand,
>
> >> We have isolated it to a test setup, where we are able
> to reproduce this somewhat consistently if we keep a node powered off.
>
> Do you mind share your setup / steps to reproduce if the setup only
> involves ZooKeeper without other dependencies?
>
>
> On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy <
> anpartha@avinetworks.com> wrote:
>
> > Folks,
> >
> > Sending a quick note again to find out if there is any insight the
> > community can offer in terms of a solution or workaround? We use
> zookeeper
> > for service discovery in our product and this issue has surfaced in a
> large
> > customer site a couple of times and we need to figure out a solution
> soon.
> >
> > Thanks,
> > Anand.
> >
> > On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy <
> > anpartha@avinetworks.com> wrote:
> >
> > > Folks,
> > >
> > > Any insight into this or any workarounds that you can think of to
> > mitigate
> > > against this issue? We have isolated it to a test setup, where we are
> > able
> > > to reproduce this somewhat consistently if we keep a node powered off.
> > >
> > > Thanks,
> > > Anand.
> > >
> > > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy <
> > > anpartha@avinetworks.com> wrote:
> > >
> > >> Hi Flavio,
> > >>
> > >> I have attached the logs from node 1 and node 3. Node 2 was powered
> off
> > >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16
> > when it
> > >> finally converged.
> > >>
> > >> Thanks,
> > >> Anand.
> > >>
> > >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <fpj@apache.org>
> > wrote:
> > >>
> > >>> Hi Anand,
> > >>>
> > >>> I don't understand whether 1 and 3 were able or even trying to
> connect
> > >>> to each other. They should be able to elect a leader between them and
> > make
> > >>> progress. You might want to upload logs and let us know.
> > >>>
> > >>> -Flavio
> > >>>
> > >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> > >>> anpartha@avinetworks.com> wrote:
> > >>> >
> > >>> > Hi,
> > >>> >
> > >>> > We are currently using zookeeper 3.4.6 version and use a 3 node
> > >>> solution in
> > >>> > our system. We see that occasionally, when a node is powered off
> (in
> > >>> this
> > >>> > instance, it was actually a leader node), the remaining two nodes
> do
> > >>> not
> > >>> > form a quorum for a really long time. Looking at the logs, it
> appears
> > >>> the
> > >>> > sequence is as follows:
> > >>> > - Node 2 is the zookeeper leader
> > >>> > - Node 2 is powered off
> > >>> > - Node 1 and Node 3 recognize and start the election
> > >>> > - Node 3 times out after initLimit * tickTime with "Timeout while
> > >>> waiting
> > >>> > for quorum" for Round N
> > >>> > - Node 1 times out after initLimit * tickTime with "Exception
while
> > >>> trying
> > >>> > to follow leader" for Round N+1 at the same time.
> > >>> > - And the process continues where N is sequentially incrementing.
> > >>> > - This happens for a long time.
> > >>> > - In one instance, we used tickTime=5000 and initLimit=20 and
it
> took
> > >>> > around 3.5 hours to converge.
> > >>> > - In a given round, Node 1 will try connecting to Node 2, gets
> > >>> connection
> > >>> > refused waits for notification timeout which increases by 2 every
> > >>> iteration
> > >>> > until it hits the initLimit. Connection Refused is because the
> node 2
> > >>> comes
> > >>> > up after reboot, but zookeeper process is not started (due to
a
> > >>> different
> > >>> > failure).
> > >>> >
> > >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection
> > timeout
> > >>> > where Node 2 is not reachable.
> > >>> >
> > >>> > Could you pls. share if you have seen this issue and if so, what
is
> > the
> > >>> > workaround that can be employed in 3.4.6.
> > >>> >
> > >>> > Thanks,
> > >>> > Anand.
> > >>>
> > >>>
> > >>
> > >
> >
>
>
>
> --
> Cheers
> Michael.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message