zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Diogo Becker <di...@se.inf.tu-dresden.de>
Subject Re: Faster Leader Election
Date Tue, 03 May 2011 06:12:41 GMT
Hi André,

André Oriani <ra078686@students.ic.unicamp.br> writes:

> Hi ,
> Where can I find a good description of the fast leader election used
> by ZooKeeper. I would like to understand better how logical clock is
> used and the two termination conditions ( a quorum has voted  for  a
> certain peer  or have received votes from all peers) , if I have
> identified correctly.

I don't think there is such a description anywhere. Here is what I

The logical clocks are used to identify different instances of the
election. Every time the leader crashes, the peers time out and the
logical clock is incremented. The same happens if the election fails to
finish: a new "election instance" or "epoch" is started.

In FLE each peer proposes its own (zxid,id). The algorithm then
converges by always broadcasting the highest pair. The hope is that the
peers should have converged before the algorithm times out. If a message
is received from a future epoch, the peer starts participating in the
this epoch.

Because zab only works when there is a quorum of nodes up and connected,
there is an additional condition to return from LookForLeader: a peer
has to have received messages from a quorum of peers. This avoids
failing the election before enough nodes are up. In fact, the timeout is
only started after a quorum has answered (is up). This timeliness
requirement is however strictly stronger than the one required by zab
itself.  Note that if all peers replied in the same epoch, there is no
need to wait a timeout.

There is a second termination case (in the default of the switch). That
is the case when a peer starts an election when there is already a
leader.  The peer gets "there-is-already-a-leader-for-epoch-X" messages
back.  If a quorum of peers answered with the same leader and epoch,
lookForLeader returns.  

Important in both termination cases (electing a new leader, and getting
the actual leader) is that the check for a quorum of votes guarantees
that the candidate (or leader) is in this quorum. This avoids the case
where the ensemble repeatedly elects a crashed peer.

> The first thing QuorumCxnManager  when connecting to a peer is to send
> its id. Is this done to avoid any problem with concurrent connections?

Yes. Peer A opens a TCP connection to peer B, and vice-versa. Both send
their ids, the highest id wins (the other connection is closed).

Ciao, Diogo

> Thanks and Regards,
> André Oriani

View raw message