zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Borokhovich <michael...@gmail.com>
Subject Re: Leader election
Date Wed, 12 Dec 2018 07:29:33 GMT
Thanks a lot for sharing the design, Ted. It is very helpful. Will check
what is applicable to our case and let you know in case of questions.

On Mon, Dec 10, 2018 at 23:37 Ted Dunning <ted.dunning@gmail.com> wrote:

> One very useful way to deal with this is the method used in MapR FS. The
> idea is that ZK should only be used rarely and short periods of two leaders
> must be tolerated, but other data has to be written with absolute
> consistency.
>
> The method that we chose was to associate an epoch number with every write,
> require all writes to to all replicas and require that all replicas only
> acknowledge writes with their idea of the current epoch for an object.
>
> What happens in the even of partition is that we have a few possible cases,
> but in any case where data replicas are split by a partition, writes will
> fail triggering a new leader election. Only replicas on the side of the new
> ZK quorum (which may be the old quorum) have a chance of succeeding here.
> If the replicas are split away from the ZK quorum, writes may not be
> possible until the partition heals. If a new leader is elected, it will
> increment the epoch and form a replication chain out of the replicas it can
> find telling them about the new epoch. Writes can then proceed. During
> partition healing, any pending writes from the old epoch will be ignored by
> the current replicas. None of the writes to the new epoch will be directed
> to the old replicas after partition healing, but such writes should be
> ignored as well.
>
> In a side process, replicas that have come back after a partition may be
> updated with writes from the new replicas. If the partition lasts long
> enough, a new replica should be formed from the members of the current
> epoch. If a new replica is formed and an old one is resurrected, then the
> old one should probably be deprecated, although data balancing
> considerations may come into play.
>
> In the actual implementation of MapR FS, there is a lot of sophistication
> that does into the details, of course, and there is actually one more level
> of delegation that happens, but this outline is good enough for a lot of
> systems.
>
> The virtues of this system are multiple:
>
> 1) partition is detected exactly as soon as it affects a write. Detecting a
> partition sooner than that doesn't serve a lot of purpose, especially since
> the time to recover from a failed write is comparable to the duration of a
> fair number of partitions.
>
> 2) having an old master continue under false pretenses does no harm since
> it cannot write to a more recent replica chain. This is more important than
> it might seem since there can be situations where clocks don't necessarily
> advance at the expected rate so what seems like a short time can actually
> be much longer (Rip van Winkle failures).
>
> 3) forcing writes to all live replicas while allowing reorganization is
> actually very fast and as long as we can retain one live replica we can
> continue writing. This is in contrast to quorum systems where dropping
> below the quorum stops writes. This is important because the replicas of
> different objects can be arranged so that the portion of the cluster with a
> ZK quorum might not have a majority of replicas for some objects.
>
> 4) electing a new master of a replica chain can be done quite quickly so
> the duration of any degradation can be quite short (because you can set
> write timeouts fairly short because an unnecessary election takes less time
> than a long timeout)
>
> Anyway, you probably already have a design in mind. If this helps anyway,
> that's great.
>
> On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich <michaelbor@gmail.com
> >
> wrote:
>
> > Makes sense. Thanks, Ted. We will design our system to cope with the
> short
> > periods where we might have two leaders.
> >
> > On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <ted.dunning@gmail.com>
> wrote:
> >
> > > ZK is able to guarantee that there is only one leader for the purposes
> of
> > > updating ZK data. That is because all commits have to originate with
> the
> > > current quorum leader and then be acknowledged by a quorum of the
> current
> > > cluster. IF the leader can't get enough acks, then it has de facto lost
> > > leadership.
> > >
> > > The problem comes when there is another system that depends on ZK data.
> > > Such data might record which node is the leader for some other
> purposes.
> > > That leader will only assume that they have become leader if they
> succeed
> > > in writing to ZK, but if there is a partition, then the old leader may
> > not
> > > be notified that another leader has established themselves until some
> > time
> > > after it has happened. Of course, if the erstwhile leader tried to
> > validate
> > > its position with a write to ZK, that write would fail, but you can't
> > spend
> > > 100% of your time doing that.
> > >
> > > it all comes down to the fact that a ZK client determines that it is
> > > connected to a ZK cluster member by pinging and that cluster member
> sees
> > > heartbeats from the leader. The fact is, though, that you can't tune
> > these
> > > pings to be faster than some level because you start to see lots of
> false
> > > positives for loss of connection. Remember that it isn't just loss of
> > > connection here that is the point. Any kind of delay would have the
> same
> > > effect. Getting these ping intervals below one second makes for a very
> > > twitchy system.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message