zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Andor Molnar <an...@cloudera.com.INVALID>
Subject Re: Leader election failing
Date Mon, 03 Sep 2018 14:55:49 GMT
Thanks for testing Chris.

So, if I understand you correctly, you're running the latest version from
branch-3.5. Could we say that this is a 3.5-only problem?
Have you ever tested the same cluster with 3.4?

Regards,
Andor



On Tue, Aug 21, 2018 at 11:29 AM, Cee Tee <c.turksema@gmail.com> wrote:

> I've tested the patch and let it run 6 days. It did not help, result is
> still the same. (remaining ZKs form islands based on datacenter they are
> in).
>
> I have mitigated it by doing a daily rolling restart.
>
> Regards,
> Chris
>
> On Mon, Aug 13, 2018 at 2:06 PM Andor Molnar <andor@cloudera.com.invalid>
> wrote:
>
> > Hi Chris,
> >
> > Would you mind testing the following patch on your test clusters?
> > I'm not entirely sure, but the issue might be related.
> >
> > https://issues.apache.org/jira/browse/ZOOKEEPER-2930
> >
> > Regards,
> > Andor
> >
> >
> >
> > On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <camille@apache.org>
> > wrote:
> >
> > > If you have the time and inclination, next time you see this problem in
> > > your test clusters get stack traces and any other diagnostics possible
> > > before restarting. I'm not an expert at network debugging but if you
> have
> > > someone who is you might want them to take a look at the connections
> and
> > > settings of any switches/firewalls/etc involved, see if there's any
> > unusual
> > > configurations or evidence of other long-lived connections failing
> (even
> > if
> > > their services handle the failures more gracefully). Send us the stack
> > > traces also it would be interesting to take a look.
> > >
> > > C
> > >
> > >
> > > On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com> wrote:
> > >
> > > > Running 3.5.5
> > > >
> > > > I managed to recreate it on acc and test cluster today, failing on
> > > > shutdown
> > > > of leader. Both had been running for over a week. After restarting
> all
> > > > zookeepers it runs fine no matter how many leader shutdowns i throw
> at
> > > it.
> > > >
> > > > On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID
> >
> > > > wrote:
> > > >
> > > > > Some kind of a network split?
> > > > >
> > > > > It looks like 1-2 and 3-4 were able to communicate each other, but
> > > > > connection timed out between these 2 splits. When 5 came back
> online
> > it
> > > > > started with supporters of (1,2) and later 3 and 4 also joined.
> > > > >
> > > > > There was no such issue the day after.
> > > > >
> > > > > Which version of ZooKeeper is this? 3.5.something?
> > > > >
> > > > > Regards,
> > > > > Andor
> > > > >
> > > > >
> > > > >
> > > > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com>
> wrote:
> > > > >
> > > > >> Actually i have similar issues on my test and acceptance clusters
> > > where
> > > > >> leader election fails if the cluster has been running for a couple
> > of
> > > > days.
> > > > >> If you stop/start the Zookeepers once they will work fine on
> further
> > > > >> disruptions that day. Not sure yet what the treshold is.
> > > > >>
> > > > >>
> > > > >> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org>
> > > > wrote:
> > > > >>
> > > > >> Hard to say. It looks like about 15 minutes after your first
> > incident
> > > > where
> > > > >>> 5 goes down and then comes back up, servers 1 and 2 get socket
> > errors
> > > > to
> > > > >>> their connections with 3, 4, and 6. It's possible if you
had
> waited
> > > > those
> > > > >>> 15 minutes, once those errors cleared the quorum would've
formed
> > with
> > > > the
> > > > >>> other servers. But as for why there were those errors in
the
> first
> > > > place
> > > > >>> it's not clear. Could be a network glitch, or an obscure
bug in
> the
> > > > >>> connection logic. Has anyone else ever seen this?
> > > > >>> If you see it again, getting a stack trace of the servers
when
> they
> > > > can't
> > > > >>> form quorum might be helpful.
> > > > >>>
> > > > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com>
> > > wrote:
> > > > >>>
> > > > >>> I have a cluster of 5 participants (id 1-5) and 1 observer
(id
> 6).
> > > > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> > > > >>>> Yesterday one of the participants (id5, by chance was
the
> leader)
> > > was
> > > > >>>> rebooted. Although all other servers were online and
not
> suffering
> > > > from
> > > > >>>> networking issues the leader election failed and the
cluster
> > > remained
> > > > >>>> "looking" until the old leader came back online after
which it
> was
> > > > >>>> promptly
> > > > >>>> elected as leader again.
> > > > >>>>
> > > > >>>> Today we tried the same exercise on the exact same servers,
5
> was
> > > > still
> > > > >>>> leader and was rebooted, and leader election worked fine
with 4
> as
> > > new
> > > > >>>> leader.
> > > > >>>>
> > > > >>>> I have included the logs.  From the logs i see that yesterday
> 1,2
> > > > never
> > > > >>>> received new leader proposals from 3,4 and vice versa.
> > > > >>>> Today all proposals came through. This is not the first
time
> we've
> > > > seen
> > > > >>>> this type of behavior, where some zookeepers can't seem
to find
> > each
> > > > >>>> other
> > > > >>>> after the leader goes down.
> > > > >>>> All servers use dynamic configuration and have the same
config
> > node.
> > > > >>>>
> > > > >>>> How could this be explained? These servers also host
a
> replicated
> > > > >>>> database
> > > > >>>> cluster and have no history of db replication issues.
> > > > >>>>
> > > > >>>> Thanks,
> > > > >>>> Chris
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>>>
> > > > >>
> > > > >>
> > > > >>
> > > >
> > > >
> > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message