zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jérémie BORDIER <jeremie.bord...@gmail.com>
Subject Re: Missing session state handling in most Leader Election implementations
Date Mon, 14 Nov 2011 11:25:18 GMT
Hello Guys,

Good to hear. I was really wondering if I was missing something but
took the risk to sound like an idiot on the list, really happy to have
pointed that out.

Jordan, as you said, pinging CuratorZookeeperClient.isConnected would
be a way to do this, but after a quick look at what isConnected does
and how the underlying state does take care of Disconnected / Expired
events, I really think it would be a shame to have another loop
pinging instead of linking the code properly so the events are
propagated up to the Election listener. I'd be really happy to
contribute as we plan on using Curator, but I think this may impact
all the recipes so you're probably the best person to link these bits

Ted, can't add much to what you said, totally agree. We already
planned on having a watchdog on this anyway. Having several master for
a very short time wouldn't be a problem for us, but this must be an
ephemeral state. (We use master election to determine a single host
responsible of generating unique 64 bit IDs with the first 16 bits
dedicated to an incremental "master id", so when a new master gets
elected, the whole range changes. Having 2 masters for a few seconds
wouldn't be an issue as the IDs cannot overlap, but still that's
something we want to avoid as much as possible).


On Mon, Nov 14, 2011 at 7:01 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
> In general all leader elections have to pay attention to disconnection.
>  The action they need to take when disconnection occurs varies according to
> the application.
> Concisely put, the application needs to specify whether it wants (a) at
> most one leader, but possibly zero, (b) at least one leader, but possibly
> more than one or (c) as close to one leader, but possibly more than one or
> zero for (hopefully) short periods of time.  I don't think that it is
> possible to do better than (c) in view of PAC considerations, especially in
> the real world where clocks are not even reliably monotonic, but you can
> definitely weight the odds so that actual behavior is much more like (a) or
> (b).
> For specification (a), a master must go into a waiting state when
> disconnected and immediately stop acting as master because it cannot know
> precisely whether or when somebody else might be elected leader.  Upon
> reconnection, either it will still be master or a session expiration event
> will be received because it is no longer master.  There is no way to know
> whether another node has become master in the meantime, nor is there any
> way for other nodes to know that the erstwhile master is still thinking
> that it might still be master (since it may have perished rather than
> simply partitioned away).  Because there is a short period between being
> disconnected and recognizing that fact, especially if clocks are confused,
> there is a small possibility of more than one master at any given time, but
> this is very unlikely.
> For specification (b), a master should continue acting as master until
> notified by a session expiration that it is not longer such.  This has the
> obvious issue that two masters may exist even after a network partition is
> healed.  It also has the problem that it takes time for an ephemeral to
> vanish which means that there can be a time with no masters.
> For specification (c), a master that is disconnected should try to remain a
> master as long as it thinks that its ephemeral still exists.  Such an
> estimation clearly depends on clock accuracy and is subject to uncertainty
> since the actual partition time cannot be known precisely.
> I don't know of any open source leader election algorithms that deal with
> these differing needs well.  Most don't even deal with strategy (a) very
> well.  Neither Netflix library nor Curator seem to at any rate.
> It is also unfortunate to only see leader election algorithms that depend
> on sequential ephemerals.  The herd effect is not an issue for typical
> leader elections which almost always involve at most a few dozen potential
> leaders and typically involve only a handfull of potential leaders.
> In my own implementations, I typically also try to account for the distinct
> possibility that my reasoning about the correctness of my implementation is
> faulty.  Typically, I do this by including a watchdog thread that inspects
> the state of the world at intervals to detect when something has gone awry.
>  For leader election, this is simply ... if I can get a valid indication
> that there is no leader file, I know that I should immediately start
> another election.  Such a watchdog would have avoided the sort of problem
> that was mentioned in the original posting.
> In general, I make enough errors that regardless of how certain I am of my
> logic, I have a hard time estimating the probability that I am actually
> correct at higher than 99%.  If I want a system that operates correctly at
> a probability of, say, 5 nines, I need to decrease the possibility that I
> have erred by 3 orders of magnitude.  One way to do this is to have
> somebody who appears more reliable than I to work on the code.  Even so,
> however, I have a hard time believing in my conservative and doubting heart
> of hearts that Camille or Ben are more than ten times more reliable than I
> am which leaves two nines left to go.  Likewise, code reviews and unit
> tests are subject to the salesmanship problem and thus cannot give me three
> nines of comfort.  A watchdog is nice that way since it has an independent
> probability of being correct that I can imagine would decrease the
> remaining chance of failure by my original 99%.
> On Sun, Nov 13, 2011 at 4:56 PM, Jordan Zimmerman <jzimmerman@netflix.com>wrote:
>> On 11/13/11 4:45 PM, "Jérémie BORDIER" <jeremie.bordier@gmail.com> wrote:
>> >Hello Jordan,
>> >
>> >Thanks a lot for your answer. I tried to figure out where the handling
>> >of Disconnected / Expired takes place, but so far I understood that to
>> >have notifyClientClosing() called from the Lock, an exception needs to
>> >be raised from somewhere. LockInternal may throw a
>> I think I spoke too soon! You are right that an exception would need to
>> get thrown. Curator will notice the disconnection but the client
>> application would have to do some kind of ZK operation in order to find
>> out about it. This is a good find.
>> I guess I could punt and say that, like checking for an interrupted
>> thread, it's the user's responsibility to periodically check
>> CuratorZookeeperClient.isConnected(). But this might be too much to ask
>> considering my goal with Curator is to alleviate users of doing these
>> kinds of things. I need to think more about thisŠ
>> -JZ

Jérémie 'ahFeel' BORDIER

View raw message