zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Question on maintaining leader/membership status in zookeeper
Date Fri, 30 Apr 2010 23:46:54 GMT
I believe Lei's concern is that the leader and all slaves can talk to 
ZK, but the slaves cannot talk to the leader. As a result no work can be 
done. However nothing will happen on the ZK side since everyone is 
heartbeating properly.

Mahadev I think you came up with a pretty good solution. However since 
the leader can see the votes from all the slaves it might just want to 
give up the lead itself and "pause" for a while (to give someone else 
the chance to be the leader). This would allow the leader to handle the 
case better where a single slave cannot talk to the leader, but the rest 
of the slaves can communicate fine.


On 04/30/2010 04:31 PM, Mahadev Konar wrote:
> Maybe I jumped the gun here but Ted's response to your query is more
> appropriate -
> ------------
> You can then use ZK in your application to pick a lead machine for other
> operations.  In that case, essentially every failure scenario is handled by
> the standard recipe.  In your example where the master and slave are cut
> off, but both still have access to ZK, all that will happen is that the
> master cannot communicate with the slave.  Both will still be clear about
> who is in which role.
> The case where the master is cut off from both ZK and the slave is also
> handled well as is the case where the master is cut off from ZK, but not
> from the slave.  In both cases, the master will get a connection loss event
> and stop trying to act like a master and the slave will be notified that the
> master has dropped out of its role.
> --------------------------
> On 4/30/10 4:14 PM, "Mahadev Konar"<mahadev@yahoo-inc.com>  wrote:
>> Hi Lei,
>>   Sorry I minsinterpreted your question! The scenario you describe could be
>> handled in such a way -
>> You could have a status node in ZooKeeper which every slave will subscribe
>> to and update! If one of the slave nodes sees that there have been too many
>> connection refused to the Leader by the slaves, the slave could go ahead and
>> delete the Leader znode, and force the Leader to give up its leadership. I
>> am not describing a deatiled way to do it, but its not very hard to come up
>> with a design for this.
>> Do you intend to have the Leader and Slaves in different Network (different
>> ACLs I mean) protected zones? In that case, it is a legitimate concern else
>> I do think assymetric network partition would be very unlikely to happen.
>> Do you usually see network partitions in such scenarios?
>> Thanks
>> mahadev
>> On 4/30/10 4:05 PM, "Lei Gao"<lgao@linkedin.com>  wrote:
>>> Hi Mahadev,
>>> Why would the leader be disconnected from ZK? ZK is fine communicating with
>>> the leader in this case. We are talking about asymmetric network failure.
>>> Yes. Leader could consider all the slaves being down if it tracks the status
>>> of all slaves himself. But I guess if ZK is used for for membership
>>> management, neither the leader nor the slaves will be considered
>>> disconnected because they can all connect to ZK.
>>> Thanks,
>>> Lei
>>> On 4/30/10 3:47 PM, "Mahadev Konar"<mahadev@yahoo-inc.com>  wrote:
>>>> Hi Lei,
>>>> In this case, the Leader will be disconnected from ZK cluster and will give
>>>> up its leadership. Since its disconnected, ZK cluster will realize that the
>>>> Leader is dead!....
>>>> When Zk cluster realizes that the Leader is dead (this is because the zk
>>>> cluster hasn't heard from the Leader for a certain time.... Configurable
>>>> session timeout parameter), the slaves will be notified of this via watchers
>>>> in zookeeper cluster. The slaves will realize that the Leader is gone and
>>>> will relect a new Leader and will start working with the new Leader.
>>>> Does that answer your question?
>>>> You might want to look though the documentation of ZK to understand its use
>>>> case and how it solves these kind of issues....
>>>> Thanks
>>>> mahadev
>>>> On 4/30/10 2:08 PM, "Lei Gao"<lgao@linkedin.com>  wrote:
>>>>> Thank you all for your answers. It clarifies a lot of my confusion about
>>>>> the
>>>>> service guarantees of ZK. I am still struggling with one failure case
(I am
>>>>> not trying to be the pain in the neck. But I need to have a full
>>>>> understanding of what ZK can offer before I make a decision on whether
>>>>> used it in my cluster.)
>>>>> Assume the following topology:
>>>>>           Leader  ==== ZK cluster
>>>>>                \\                    //
>>>>>                 \\                  //
>>>>>                   \\               //
>>>>>                        Slave(s)
>>>>> If I am asymmetric network failure such that the connection between Leader
>>>>> and Slave(s) are broken while all other connections are still alive,
>>>>> my system hang after some point? Because no new leader election will
>>>>> initiated by slaves and the leader can't get the work to slave(s).
>>>>> Thanks,
>>>>> Lei
>>>>> On 4/30/10 1:54 PM, "Ted Dunning"<ted.dunning@gmail.com>  wrote:
>>>>>> If one of your user clients can no longer reach one member of the
>>>>>> cluster, then it will try to reach another.  If it succeeds, then
it will
>>>>>> continue without any problems as long as the ZK cluster itself is
>>>>>> This applies for all the ZK recipes.  You will have to be a little
>>>>>> careful to handle connection loss, but that should get easier soon
>>>>>> isn't all that difficult anyway).
>>>>>> On Fri, Apr 30, 2010 at 1:26 PM, Lei Gao<lgao@linkedin.com>
>>>>>>> I am not talking about the leader election within zookeeper cluster.
>>>>>>> guess
>>>>>>> I didn't make the discussion context clear. In my case, I run
a cluster
>>>>>>> that
>>>>>>> uses zookeeper for doing the leader election. Yes, nodes in my
>>>>>>> are
>>>>>>> the clients of zookeeper.  Those nodes depend on zookeeper to
elect a new
>>>>>>> leader and figure out what the current leader is. So if the zookeeper
>>>>>>> (think
>>>>>>> of it as a stand-alone entity) becomes unavailabe in the way
>>>>>>> described
>>>>>>> earlier, how can I handle such situation so my cluster can still
>>>>>>> while a majority of nodes still connect to each other (but not
to the
>>>>>>> zookeeper)?

View raw message