zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jzimmer...@netflix.com>
Subject Re: Missing session state handling in most Leader Election implementations
Date Fri, 18 Nov 2011 18:04:47 GMT
I just did a quickie test. If the cluster goes down you get the Disconnect
but do not get a session expiration. So, there wouldn't be an opportunity
to transition from SUSPENDED to LOST (unless the client makes another ZK
call). So, this brings me back to doing the background sync().

-JZ

On 11/18/11 9:52 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

>Is the background sync even necessary?  The ZK client itself will
>re-establish connection if it can.
>
>I think that LOST should only be sent on session expiration.
>
>On Fri, Nov 18, 2011 at 1:07 AM, Jordan Zimmerman
><jzimmerman@netflix.com>wrote:
>
>> FYI
>>
>> Curator now has a staged connection notification mechanism for dealing
>> with issues like this. When the Curator managed connection receives a
>> Disconnect, it posts a message to listeners that the connection is
>> SUSPENDED. If the connection can be re-established (via a background
>>sync()
>> using the current retry policy) the listeners receive RECONNECTED
>>otherwise
>> they receive LOST. Thus, users of the Curator LeaderSelector can know if
>> they should pause their leader activity and/or stop leader activity.
>>
>> -JZ
>> ________________________________________
>> From: Ted Dunning [ted.dunning@gmail.com]
>> Sent: Monday, November 14, 2011 6:24 PM
>> To: user@zookeeper.apache.org
>> Subject: Re: Missing session state handling in most Leader Election
>> implementations
>>
>> On Mon, Nov 14, 2011 at 2:41 PM, Jordan Zimmerman
>><jzimmerman@netflix.com
>> >wrote:
>>
>> > It turns out that this is tricky to solve. When the server you're
>> > connected to goes down, you get a
>>Watcher.Event.KeeperState.Disconnected.
>> > However, it could be that you are able to reconnect to another server
>>so
>> > the disconnected event should be ignored.
>>
>>
>> The event should not be ignored.  The master should pause in being a
>> master, but not unload any major data structures.  If it reconnects
>> instantly, then it should continue as if nothing had happened.  You can
>> also have a time limit for how long you wait before you decide to pause
>> operation as master.  As you increase that time, you increase the
>> probability of two masters existing at the same time.  If the reconnect
>> happens before the timeout, you don't need to both the master.
>>


Mime
View raw message