zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Han <h...@cloudera.com>
Subject Re: Zookeeper leader election takes a long time.
Date Tue, 11 Oct 2016 22:46:23 GMT
Hi Anand,

>> We have isolated it to a test setup, where we are able
to reproduce this somewhat consistently if we keep a node powered off.

Do you mind share your setup / steps to reproduce if the setup only
involves ZooKeeper without other dependencies?


On Tue, Oct 11, 2016 at 2:56 PM, Anand Parthasarathy <
anpartha@avinetworks.com> wrote:

> Folks,
>
> Sending a quick note again to find out if there is any insight the
> community can offer in terms of a solution or workaround? We use zookeeper
> for service discovery in our product and this issue has surfaced in a large
> customer site a couple of times and we need to figure out a solution soon.
>
> Thanks,
> Anand.
>
> On Mon, Oct 10, 2016 at 10:15 AM, Anand Parthasarathy <
> anpartha@avinetworks.com> wrote:
>
> > Folks,
> >
> > Any insight into this or any workarounds that you can think of to
> mitigate
> > against this issue? We have isolated it to a test setup, where we are
> able
> > to reproduce this somewhat consistently if we keep a node powered off.
> >
> > Thanks,
> > Anand.
> >
> > On Sat, Oct 8, 2016 at 10:05 AM, Anand Parthasarathy <
> > anpartha@avinetworks.com> wrote:
> >
> >> Hi Flavio,
> >>
> >> I have attached the logs from node 1 and node 3. Node 2 was powered off
> >> around 10-03 12:36. Leader election kept going until 10-03 15:57:16
> when it
> >> finally converged.
> >>
> >> Thanks,
> >> Anand.
> >>
> >> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <fpj@apache.org>
> wrote:
> >>
> >>> Hi Anand,
> >>>
> >>> I don't understand whether 1 and 3 were able or even trying to connect
> >>> to each other. They should be able to elect a leader between them and
> make
> >>> progress. You might want to upload logs and let us know.
> >>>
> >>> -Flavio
> >>>
> >>> > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> >>> anpartha@avinetworks.com> wrote:
> >>> >
> >>> > Hi,
> >>> >
> >>> > We are currently using zookeeper 3.4.6 version and use a 3 node
> >>> solution in
> >>> > our system. We see that occasionally, when a node is powered off (in
> >>> this
> >>> > instance, it was actually a leader node), the remaining two nodes do
> >>> not
> >>> > form a quorum for a really long time. Looking at the logs, it appears
> >>> the
> >>> > sequence is as follows:
> >>> > - Node 2 is the zookeeper leader
> >>> > - Node 2 is powered off
> >>> > - Node 1 and Node 3 recognize and start the election
> >>> > - Node 3 times out after initLimit * tickTime with "Timeout while
> >>> waiting
> >>> > for quorum" for Round N
> >>> > - Node 1 times out after initLimit * tickTime with "Exception while
> >>> trying
> >>> > to follow leader" for Round N+1 at the same time.
> >>> > - And the process continues where N is sequentially incrementing.
> >>> > - This happens for a long time.
> >>> > - In one instance, we used tickTime=5000 and initLimit=20 and it took
> >>> > around 3.5 hours to converge.
> >>> > - In a given round, Node 1 will try connecting to Node 2, gets
> >>> connection
> >>> > refused waits for notification timeout which increases by 2 every
> >>> iteration
> >>> > until it hits the initLimit. Connection Refused is because the node
2
> >>> comes
> >>> > up after reboot, but zookeeper process is not started (due to a
> >>> different
> >>> > failure).
> >>> >
> >>> > It looks similar to ZOOKEEPER-2164 but there it is a connection
> timeout
> >>> > where Node 2 is not reachable.
> >>> >
> >>> > Could you pls. share if you have seen this issue and if so, what is
> the
> >>> > workaround that can be employed in 3.4.6.
> >>> >
> >>> > Thanks,
> >>> > Anand.
> >>>
> >>>
> >>
> >
>



-- 
Cheers
Michael.

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message