zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris <c.turks...@gmail.com>
Subject Re: Leader election failing
Date Mon, 13 Aug 2018 12:14:00 GMT
Interesting, i will have a look at it.
Thanks
Chris

On 13 August 2018 2:06:55 pm Andor Molnar <andor@cloudera.com.INVALID> wrote:

> Hi Chris,
>
> Would you mind testing the following patch on your test clusters?
> I'm not entirely sure, but the issue might be related.
>
> https://issues.apache.org/jira/browse/ZOOKEEPER-2930
>
> Regards,
> Andor
>
>
>
> On Wed, Aug 8, 2018 at 6:51 PM, Camille Fournier <camille@apache.org> wrote:
>
>> If you have the time and inclination, next time you see this problem in
>> your test clusters get stack traces and any other diagnostics possible
>> before restarting. I'm not an expert at network debugging but if you have
>> someone who is you might want them to take a look at the connections and
>> settings of any switches/firewalls/etc involved, see if there's any unusual
>> configurations or evidence of other long-lived connections failing (even if
>> their services handle the failures more gracefully). Send us the stack
>> traces also it would be interesting to take a look.
>>
>> C
>>
>>
>> On Wed, Aug 8, 2018, 11:09 AM Chris <c.turksema@gmail.com> wrote:
>>
>> > Running 3.5.5
>> >
>> > I managed to recreate it on acc and test cluster today, failing on
>> > shutdown
>> > of leader. Both had been running for over a week. After restarting all
>> > zookeepers it runs fine no matter how many leader shutdowns i throw at
>> it.
>> >
>> > On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID>
>> > wrote:
>> >
>> > > Some kind of a network split?
>> > >
>> > > It looks like 1-2 and 3-4 were able to communicate each other, but
>> > > connection timed out between these 2 splits. When 5 came back online it
>> > > started with supporters of (1,2) and later 3 and 4 also joined.
>> > >
>> > > There was no such issue the day after.
>> > >
>> > > Which version of ZooKeeper is this? 3.5.something?
>> > >
>> > > Regards,
>> > > Andor
>> > >
>> > >
>> > >
>> > > On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com> wrote:
>> > >
>> > >> Actually i have similar issues on my test and acceptance clusters
>> where
>> > >> leader election fails if the cluster has been running for a couple
of
>> > days.
>> > >> If you stop/start the Zookeepers once they will work fine on further
>> > >> disruptions that day. Not sure yet what the treshold is.
>> > >>
>> > >>
>> > >> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org>
>> > wrote:
>> > >>
>> > >> Hard to say. It looks like about 15 minutes after your first incident
>> > where
>> > >>> 5 goes down and then comes back up, servers 1 and 2 get socket
errors
>> > to
>> > >>> their connections with 3, 4, and 6. It's possible if you had waited
>> > those
>> > >>> 15 minutes, once those errors cleared the quorum would've formed
with
>> > the
>> > >>> other servers. But as for why there were those errors in the first
>> > place
>> > >>> it's not clear. Could be a network glitch, or an obscure bug in
the
>> > >>> connection logic. Has anyone else ever seen this?
>> > >>> If you see it again, getting a stack trace of the servers when
they
>> > can't
>> > >>> form quorum might be helpful.
>> > >>>
>> > >>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com>
>> wrote:
>> > >>>
>> > >>> I have a cluster of 5 participants (id 1-5) and 1 observer (id
6).
>> > >>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
>> > >>>> Yesterday one of the participants (id5, by chance was the leader)
>> was
>> > >>>> rebooted. Although all other servers were online and not suffering
>> > from
>> > >>>> networking issues the leader election failed and the cluster
>> remained
>> > >>>> "looking" until the old leader came back online after which
it was
>> > >>>> promptly
>> > >>>> elected as leader again.
>> > >>>>
>> > >>>> Today we tried the same exercise on the exact same servers,
5 was
>> > still
>> > >>>> leader and was rebooted, and leader election worked fine with
4 as
>> new
>> > >>>> leader.
>> > >>>>
>> > >>>> I have included the logs.  From the logs i see that yesterday
1,2
>> > never
>> > >>>> received new leader proposals from 3,4 and vice versa.
>> > >>>> Today all proposals came through. This is not the first time
we've
>> > seen
>> > >>>> this type of behavior, where some zookeepers can't seem to
find each
>> > >>>> other
>> > >>>> after the leader goes down.
>> > >>>> All servers use dynamic configuration and have the same config
node.
>> > >>>>
>> > >>>> How could this be explained? These servers also host a replicated
>> > >>>> database
>> > >>>> cluster and have no history of db replication issues.
>> > >>>>
>> > >>>> Thanks,
>> > >>>> Chris
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>>>
>> > >>
>> > >>
>> > >>
>> >
>> >
>> >
>> >
>>




Mime
View raw message