zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris <c.turks...@gmail.com>
Subject Re: Leader election failing
Date Wed, 08 Aug 2018 15:09:43 GMT
Running 3.5.5

I managed to recreate it on acc and test cluster today, failing on shutdown 
of leader. Both had been running for over a week. After restarting all 
zookeepers it runs fine no matter how many leader shutdowns i throw at it.

On 8 August 2018 5:05:34 pm Andor Molnar <andor@cloudera.com.INVALID> wrote:

> Some kind of a network split?
>
> It looks like 1-2 and 3-4 were able to communicate each other, but
> connection timed out between these 2 splits. When 5 came back online it
> started with supporters of (1,2) and later 3 and 4 also joined.
>
> There was no such issue the day after.
>
> Which version of ZooKeeper is this? 3.5.something?
>
> Regards,
> Andor
>
>
>
> On Wed, Aug 8, 2018 at 4:52 PM, Chris <c.turksema@gmail.com> wrote:
>
>> Actually i have similar issues on my test and acceptance clusters where
>> leader election fails if the cluster has been running for a couple of days.
>> If you stop/start the Zookeepers once they will work fine on further
>> disruptions that day. Not sure yet what the treshold is.
>>
>>
>> On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org> wrote:
>>
>> Hard to say. It looks like about 15 minutes after your first incident where
>>> 5 goes down and then comes back up, servers 1 and 2 get socket errors to
>>> their connections with 3, 4, and 6. It's possible if you had waited those
>>> 15 minutes, once those errors cleared the quorum would've formed with the
>>> other servers. But as for why there were those errors in the first place
>>> it's not clear. Could be a network glitch, or an obscure bug in the
>>> connection logic. Has anyone else ever seen this?
>>> If you see it again, getting a stack trace of the servers when they can't
>>> form quorum might be helpful.
>>>
>>> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com> wrote:
>>>
>>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
>>>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
>>>> Yesterday one of the participants (id5, by chance was the leader) was
>>>> rebooted. Although all other servers were online and not suffering from
>>>> networking issues the leader election failed and the cluster remained
>>>> "looking" until the old leader came back online after which it was
>>>> promptly
>>>> elected as leader again.
>>>>
>>>> Today we tried the same exercise on the exact same servers, 5 was still
>>>> leader and was rebooted, and leader election worked fine with 4 as new
>>>> leader.
>>>>
>>>> I have included the logs.  From the logs i see that yesterday 1,2 never
>>>> received new leader proposals from 3,4 and vice versa.
>>>> Today all proposals came through. This is not the first time we've seen
>>>> this type of behavior, where some zookeepers can't seem to find each
>>>> other
>>>> after the leader goes down.
>>>> All servers use dynamic configuration and have the same config node.
>>>>
>>>> How could this be explained? These servers also host a replicated
>>>> database
>>>> cluster and have no history of db replication issues.
>>>>
>>>> Thanks,
>>>> Chris
>>>>
>>>>
>>>>
>>>>
>>
>>
>>




Mime
View raw message