zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Leader election
Date Tue, 11 Dec 2018 07:36:50 GMT
One very useful way to deal with this is the method used in MapR FS. The
idea is that ZK should only be used rarely and short periods of two leaders
must be tolerated, but other data has to be written with absolute

The method that we chose was to associate an epoch number with every write,
require all writes to to all replicas and require that all replicas only
acknowledge writes with their idea of the current epoch for an object.

What happens in the even of partition is that we have a few possible cases,
but in any case where data replicas are split by a partition, writes will
fail triggering a new leader election. Only replicas on the side of the new
ZK quorum (which may be the old quorum) have a chance of succeeding here.
If the replicas are split away from the ZK quorum, writes may not be
possible until the partition heals. If a new leader is elected, it will
increment the epoch and form a replication chain out of the replicas it can
find telling them about the new epoch. Writes can then proceed. During
partition healing, any pending writes from the old epoch will be ignored by
the current replicas. None of the writes to the new epoch will be directed
to the old replicas after partition healing, but such writes should be
ignored as well.

In a side process, replicas that have come back after a partition may be
updated with writes from the new replicas. If the partition lasts long
enough, a new replica should be formed from the members of the current
epoch. If a new replica is formed and an old one is resurrected, then the
old one should probably be deprecated, although data balancing
considerations may come into play.

In the actual implementation of MapR FS, there is a lot of sophistication
that does into the details, of course, and there is actually one more level
of delegation that happens, but this outline is good enough for a lot of

The virtues of this system are multiple:

1) partition is detected exactly as soon as it affects a write. Detecting a
partition sooner than that doesn't serve a lot of purpose, especially since
the time to recover from a failed write is comparable to the duration of a
fair number of partitions.

2) having an old master continue under false pretenses does no harm since
it cannot write to a more recent replica chain. This is more important than
it might seem since there can be situations where clocks don't necessarily
advance at the expected rate so what seems like a short time can actually
be much longer (Rip van Winkle failures).

3) forcing writes to all live replicas while allowing reorganization is
actually very fast and as long as we can retain one live replica we can
continue writing. This is in contrast to quorum systems where dropping
below the quorum stops writes. This is important because the replicas of
different objects can be arranged so that the portion of the cluster with a
ZK quorum might not have a majority of replicas for some objects.

4) electing a new master of a replica chain can be done quite quickly so
the duration of any degradation can be quite short (because you can set
write timeouts fairly short because an unnecessary election takes less time
than a long timeout)

Anyway, you probably already have a design in mind. If this helps anyway,
that's great.

On Mon, Dec 10, 2018 at 10:32 PM Michael Borokhovich <michaelbor@gmail.com>

> Makes sense. Thanks, Ted. We will design our system to cope with the short
> periods where we might have two leaders.
> On Thu, Dec 6, 2018 at 11:03 PM Ted Dunning <ted.dunning@gmail.com> wrote:
> > ZK is able to guarantee that there is only one leader for the purposes of
> > updating ZK data. That is because all commits have to originate with the
> > current quorum leader and then be acknowledged by a quorum of the current
> > cluster. IF the leader can't get enough acks, then it has de facto lost
> > leadership.
> >
> > The problem comes when there is another system that depends on ZK data.
> > Such data might record which node is the leader for some other purposes.
> > That leader will only assume that they have become leader if they succeed
> > in writing to ZK, but if there is a partition, then the old leader may
> not
> > be notified that another leader has established themselves until some
> time
> > after it has happened. Of course, if the erstwhile leader tried to
> validate
> > its position with a write to ZK, that write would fail, but you can't
> spend
> > 100% of your time doing that.
> >
> > it all comes down to the fact that a ZK client determines that it is
> > connected to a ZK cluster member by pinging and that cluster member sees
> > heartbeats from the leader. The fact is, though, that you can't tune
> these
> > pings to be faster than some level because you start to see lots of false
> > positives for loss of connection. Remember that it isn't just loss of
> > connection here that is the point. Any kind of delay would have the same
> > effect. Getting these ping intervals below one second makes for a very
> > twitchy system.
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message