From Chris <c.turks...@gmail.com>
Subject Re: Leader election failing
Date Wed, 08 Aug 2018 14:52:03 GMT
Actually i have similar issues on my test and acceptance clusters where 
leader election fails if the cluster has been running for a couple of days. 
If you stop/start the Zookeepers once they will work fine on further 
disruptions that day. Not sure yet what the treshold is.

On 8 August 2018 4:32:56 pm Camille Fournier <camille@apache.org> wrote:

> Hard to say. It looks like about 15 minutes after your first incident where
> 5 goes down and then comes back up, servers 1 and 2 get socket errors to
> their connections with 3, 4, and 6. It's possible if you had waited those
> 15 minutes, once those errors cleared the quorum would've formed with the
> other servers. But as for why there were those errors in the first place
> it's not clear. Could be a network glitch, or an obscure bug in the
> connection logic. Has anyone else ever seen this?
> If you see it again, getting a stack trace of the servers when they can't
> form quorum might be helpful.
> On Wed, Aug 8, 2018 at 11:52 AM Cee Tee <c.turksema@gmail.com> wrote:
>> I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
>> 1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
>> Yesterday one of the participants (id5, by chance was the leader) was
>> rebooted. Although all other servers were online and not suffering from
>> networking issues the leader election failed and the cluster remained
>> "looking" until the old leader came back online after which it was promptly
>> elected as leader again.
>> Today we tried the same exercise on the exact same servers, 5 was still
>> leader and was rebooted, and leader election worked fine with 4 as new
>> leader.
>> I have included the logs.  From the logs i see that yesterday 1,2 never
>> received new leader proposals from 3,4 and vice versa.
>> Today all proposals came through. This is not the first time we've seen
>> this type of behavior, where some zookeepers can't seem to find each other
>> after the leader goes down.
>> All servers use dynamic configuration and have the same config node.
>> How could this be explained? These servers also host a replicated database
>> cluster and have no history of db replication issues.
>> Thanks,
>> Chris

