Yes, we had noted in the past that if the node2 comes back up, it
converges. In this case, in node2, zookeeper could not be started and in
this case, it took 3 hours to converge.
Thanks,
Anand.
On Sat, Oct 8, 2016 at 9:50 AM, Ben Sherman <bensherman@gmail.com> wrote:
> I had this exact problem but couldn't wait for them to reestablish
> connection and when node 2 came back up, everything recovered. It was with
> 3.4.8 - I posted logs in a previous message.
>
> On Sat, Oct 8, 2016 at 7:55 AM, Flavio Junqueira <fpj@apache.org> wrote:
>
> > Hi Anand,
> >
> > I don't understand whether 1 and 3 were able or even trying to connect to
> > each other. They should be able to elect a leader between them and make
> > progress. You might want to upload logs and let us know.
> >
> > -Flavio
> >
> > > On 08 Oct 2016, at 02:11, Anand Parthasarathy <
> anpartha@avinetworks.com>
> > wrote:
> > >
> > > Hi,
> > >
> > > We are currently using zookeeper 3.4.6 version and use a 3 node
> solution
> > in
> > > our system. We see that occasionally, when a node is powered off (in
> this
> > > instance, it was actually a leader node), the remaining two nodes do
> not
> > > form a quorum for a really long time. Looking at the logs, it appears
> the
> > > sequence is as follows:
> > > - Node 2 is the zookeeper leader
> > > - Node 2 is powered off
> > > - Node 1 and Node 3 recognize and start the election
> > > - Node 3 times out after initLimit * tickTime with "Timeout while
> waiting
> > > for quorum" for Round N
> > > - Node 1 times out after initLimit * tickTime with "Exception while
> > trying
> > > to follow leader" for Round N+1 at the same time.
> > > - And the process continues where N is sequentially incrementing.
> > > - This happens for a long time.
> > > - In one instance, we used tickTime=5000 and initLimit=20 and it took
> > > around 3.5 hours to converge.
> > > - In a given round, Node 1 will try connecting to Node 2, gets
> connection
> > > refused waits for notification timeout which increases by 2 every
> > iteration
> > > until it hits the initLimit. Connection Refused is because the node 2
> > comes
> > > up after reboot, but zookeeper process is not started (due to a
> different
> > > failure).
> > >
> > > It looks similar to ZOOKEEPER-2164 but there it is a connection timeout
> > > where Node 2 is not reachable.
> > >
> > > Could you pls. share if you have seen this issue and if so, what is the
> > > workaround that can be employed in 3.4.6.
> > >
> > > Thanks,
> > > Anand.
> >
> >
>
|