zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Camille Fournier <cami...@apache.org>
Subject Re: Leader election failing
Date Wed, 08 Aug 2018 16:51:58 GMT
If you have the time and inclination, next time you see this problem in
your test clusters get stack traces and any other diagnostics possible
before restarting. I'm not an expert at network debugging but if you have
someone who is you might want them to take a look at the connections and
settings of any switches/firewalls/etc involved, see if there's any unusual
configurations or evidence of other long-lived connections failing (even if
their services handle the failures more gracefully). Send us the stack
traces also it would be interesting to take a look.

C


On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com> wrote:

> Running 3.5.5
>
> I managed to recreate it on acc and test cluster today, failing on
> shutdown
> of leader. Both had been running for over a week. After restarting all
> zookeepers it runs fine no matter how many leader shutdowns i throw at it.
>
> On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID>
> wrote:
>
> > Some kind of a network split?
> >
> > It looks like 1-2 and 3-4 were able to communicate each other, but
> > connection timed out between these 2 splits. When 5 came back online it
> > started with supporters of (1,2) and later 3 and 4 also joined.
> >
> > There was no such issue the day after.
> >
> > Which version of ZooKeeper is this? 3.5.something?
> >
> > Regards,
> > Andor
> >
> >
> >
> > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com> wrote:
> >
> >> Actually i have similar issues on my test and acceptance clusters where
> >> leader election fails if the cluster has been running for a couple of
> days.
> >> If you stop/start the Zookeepers once they will work fine on further
> >> disruptions that day. Not sure yet what the treshold is.
> >>
> >>
> >> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org>
> wrote:
> >>
> >> Hard to say. It looks like about 15 minutes after your first incident
> where
> >>> 5 goes down and then comes back up, servers 1 and 2 get socket errors
> to
> >>> their connections with 3, 4, and 6. It's possible if you had waited
> those
> >>> 15 minutes, once those errors cleared the quorum would've formed with
> the
> >>> other servers. But as for why there were those errors in the first
> place
> >>> it's not clear. Could be a network glitch, or an obscure bug in the
> >>> connection logic. Has anyone else ever seen this?
> >>> If you see it again, getting a stack trace of the servers when they
> can't
> >>> form quorum might be helpful.
> >>>
> >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com> wrote:
> >>>
> >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
> >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
> >>>> Yesterday one of the participants (id5, by chance was the leader) was
> >>>> rebooted. Although all other servers were online and not suffering
> from
> >>>> networking issues the leader election failed and the cluster remained
> >>>> "looking" until the old leader came back online after which it was
> >>>> promptly
> >>>> elected as leader again.
> >>>>
> >>>> Today we tried the same exercise on the exact same servers, 5 was
> still
> >>>> leader and was rebooted, and leader election worked fine with 4 as new
> >>>> leader.
> >>>>
> >>>> I have included the logs.  From the logs i see that yesterday 1,2
> never
> >>>> received new leader proposals from 3,4 and vice versa.
> >>>> Today all proposals came through. This is not the first time we've
> seen
> >>>> this type of behavior, where some zookeepers can't seem to find each
> >>>> other
> >>>> after the leader goes down.
> >>>> All servers use dynamic configuration and have the same config node.
> >>>>
> >>>> How could this be explained? These servers also host a replicated
> >>>> database
> >>>> cluster and have no history of db replication issues.
> >>>>
> >>>> Thanks,
> >>>> Chris
> >>>>
> >>>>
> >>>>
> >>>>
> >>
> >>
> >>
>
>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message