zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathryn Hogg <Kathryn.H...@oati.net>
Subject RE: About ZooKeeper Dynamic Reconfiguration
Date Wed, 21 Aug 2019 18:34:35 GMT
At my organization we solve that by running a 3rd site as mentioned in another email.  We run
a 5 node ensemble with 2 nodes in each primary data center and 1 node in the co-location facility.
 We try to minimize usage of the 5th node so we explicitly exclude it from our clients' connection

This way, if there is a network partition between datacenters, which ever one can still talk
to the node at the 3rd datacenter will maintain quorum.

Ideally, if it was possible, we'd somehow like the node at the third datacenter to never be
elected as the leader and even better if there was some way for it to be a voting member only
and not bear any data (similar to mongodb's arbiter).

-----Original Message-----
From: Cee Tee [mailto:c.turksema@gmail.com] 
Sent: Wednesday, August 21, 2019 1:27 PM
To: Alexander Shraer <shralex@gmail.com>
Cc: user@zookeeper.apache.org
Subject: Re: About ZooKeeper Dynamic Reconfiguration

{External email message: This email is from an external source. Please exercise caution prior
to opening attachments, clicking on links, or providing any sensitive information.}

Yes, one side loses quorum and the other remains active. However we actively control which
side that is, because our main application is active/passive with 2 datacenters. We need Zookeeper
to remain active in the applications active datacenter.

On 21 August 2019 17:22:00 Alexander Shraer <shralex@gmail.com> wrote:
> That's great! Thanks for sharing.
>> Added benefit is that we can also control which data center gets the 
>> quorum in case of a network outage between the two.
> Can you explain how this works? In case of a network outage between 
> two DCs, one of them has a quorum of participants and the other doesn't.
> The participants in the smaller set should not be operational at this 
> time, since they can't get quorum. no ?
> Thanks,
> Alex
> On Wed, Aug 21, 2019 at 7:55 AM Cee Tee <c.turksema@gmail.com> wrote:
> We have solved this by implementing a 'zookeeper cluster balancer', it 
> calls the admin server api of each zookeeper to get the current status 
> and will issue dynamic reconfigure commands to change dead servers 
> into observers so the quorum is not in danger. Once the dead servers 
> reconnect, they take the observer role and are then reconfigured into participants again.
> Added benefit is that we can also control which data center gets the 
> quorum in case of a network outage between the two.
> Regards
> Chris
> On 21 August 2019 16:42:37 Alexander Shraer <shralex@gmail.com> wrote:
>> Hi,
>> Reconfiguration, as implemented, is not automatic. In your case, when 
>> failures happen, this doesn't change the ensemble membership.
>> When 2 of 5 fail, this is still a minority, so everything should work 
>> normally, you just won't be able to handle an additional failure. If 
>> you'd like to remove them from the ensemble, you need to issue an 
>> explicit reconfiguration command to do so.
>> Please see details in the manual:
>> https://zookeeper.apache.org/doc/r3.5.5/zookeeperReconfig.html
>> Alex
>> On Wed, Aug 21, 2019 at 7:29 AM Gao,Wei <Wei.Gao@arcserve.com> wrote:
>>> Hi
>>>    I encounter a problem which blocks my development of load balance 
>>> using ZooKeeper 3.5.5.
>>>    Actually, I have a ZooKeeper cluster which comprises of five zk 
>>> servers. And the dynamic configuration file is as follows:
>>>   server.1=zk1:2888:3888:participant;
>>>   server.2=zk2:2888:3888:participant;
>>>   server.3=zk3:2888:3888:participant;
>>>   server.4=zk4:2888:3888:participant;
>>>   server.5=zk5:2888:3888:participant;
>>>   The zk cluster can work fine if every member works normally. 
>>> However, if say two of them are suddenly down without previously 
>>> being notified, the dynamic configuration file shown above will not 
>>> be synchronized dynamically, which leads to the zk cluster fail to work normally.
>>>   I think this is a very common case which may happen at any time. 
>>> If so, how can we resolve it?
>>>   Really look forward to hearing from you!
>>> Thanks

View raw message