zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Missing session state handling in most Leader Election implementations
Date Mon, 14 Nov 2011 06:01:05 GMT
In general all leader elections have to pay attention to disconnection.
 The action they need to take when disconnection occurs varies according to
the application.

Concisely put, the application needs to specify whether it wants (a) at
most one leader, but possibly zero, (b) at least one leader, but possibly
more than one or (c) as close to one leader, but possibly more than one or
zero for (hopefully) short periods of time.  I don't think that it is
possible to do better than (c) in view of PAC considerations, especially in
the real world where clocks are not even reliably monotonic, but you can
definitely weight the odds so that actual behavior is much more like (a) or

For specification (a), a master must go into a waiting state when
disconnected and immediately stop acting as master because it cannot know
precisely whether or when somebody else might be elected leader.  Upon
reconnection, either it will still be master or a session expiration event
will be received because it is no longer master.  There is no way to know
whether another node has become master in the meantime, nor is there any
way for other nodes to know that the erstwhile master is still thinking
that it might still be master (since it may have perished rather than
simply partitioned away).  Because there is a short period between being
disconnected and recognizing that fact, especially if clocks are confused,
there is a small possibility of more than one master at any given time, but
this is very unlikely.

For specification (b), a master should continue acting as master until
notified by a session expiration that it is not longer such.  This has the
obvious issue that two masters may exist even after a network partition is
healed.  It also has the problem that it takes time for an ephemeral to
vanish which means that there can be a time with no masters.

For specification (c), a master that is disconnected should try to remain a
master as long as it thinks that its ephemeral still exists.  Such an
estimation clearly depends on clock accuracy and is subject to uncertainty
since the actual partition time cannot be known precisely.

I don't know of any open source leader election algorithms that deal with
these differing needs well.  Most don't even deal with strategy (a) very
well.  Neither Netflix library nor Curator seem to at any rate.

It is also unfortunate to only see leader election algorithms that depend
on sequential ephemerals.  The herd effect is not an issue for typical
leader elections which almost always involve at most a few dozen potential
leaders and typically involve only a handfull of potential leaders.

In my own implementations, I typically also try to account for the distinct
possibility that my reasoning about the correctness of my implementation is
faulty.  Typically, I do this by including a watchdog thread that inspects
the state of the world at intervals to detect when something has gone awry.
 For leader election, this is simply ... if I can get a valid indication
that there is no leader file, I know that I should immediately start
another election.  Such a watchdog would have avoided the sort of problem
that was mentioned in the original posting.

In general, I make enough errors that regardless of how certain I am of my
logic, I have a hard time estimating the probability that I am actually
correct at higher than 99%.  If I want a system that operates correctly at
a probability of, say, 5 nines, I need to decrease the possibility that I
have erred by 3 orders of magnitude.  One way to do this is to have
somebody who appears more reliable than I to work on the code.  Even so,
however, I have a hard time believing in my conservative and doubting heart
of hearts that Camille or Ben are more than ten times more reliable than I
am which leaves two nines left to go.  Likewise, code reviews and unit
tests are subject to the salesmanship problem and thus cannot give me three
nines of comfort.  A watchdog is nice that way since it has an independent
probability of being correct that I can imagine would decrease the
remaining chance of failure by my original 99%.

On Sun, Nov 13, 2011 at 4:56 PM, Jordan Zimmerman <jzimmerman@netflix.com>wrote:

> On 11/13/11 4:45 PM, "Jérémie BORDIER" <jeremie.bordier@gmail.com> wrote:
> >Hello Jordan,
> >
> >Thanks a lot for your answer. I tried to figure out where the handling
> >of Disconnected / Expired takes place, but so far I understood that to
> >have notifyClientClosing() called from the Lock, an exception needs to
> >be raised from somewhere. LockInternal may throw a
> I think I spoke too soon! You are right that an exception would need to
> get thrown. Curator will notice the disconnection but the client
> application would have to do some kind of ZK operation in order to find
> out about it. This is a good find.
> I guess I could punt and say that, like checking for an interrupted
> thread, it's the user's responsibility to periodically check
> CuratorZookeeperClient.isConnected(). But this might be too much to ask
> considering my goal with Curator is to alleviate users of doing these
> kinds of things. I need to think more about thisŠ
> -JZ

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message