From Cee Tee <c.turks...@gmail.com>
Subject Leader election failing
Date Wed, 08 Aug 2018 11:43:24 GMT
I have a cluster of 5 participants (id 1-5) and 1 observer (id 6).
1,2,5 are in datacenter A. 3,4,6 are in datacenter B.
Yesterday one of the participants (id5, by chance was the leader) was
rebooted. Although all other servers were online and not suffering from
networking issues the leader election failed and the cluster remained
"looking" until the old leader came back online after which it was promptly
elected as leader again.

Today we tried the same exercise on the exact same servers, 5 was still
leader and was rebooted, and leader election worked fine with 4 as new

I have included the logs.  From the logs i see that yesterday 1,2 never
received new leader proposals from 3,4 and vice versa.
Today all proposals came through. This is not the first time we've seen
this type of behavior, where some zookeepers can't seem to find each other
after the leader goes down.
All servers use dynamic configuration and have the same config node.

How could this be explained? These servers also host a replicated database
cluster and have no history of db replication issues.


