zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zili Chen <wander4...@gmail.com>
Subject Leader election and leader operation based on zookeeper
Date Fri, 20 Sep 2019 08:30:51 GMT
Hi ZooKeepers,

Recently there is an ongoing refactor[1] in Flink community aimed at
overcoming several inconsistent state issues on ZK we have met. I come
here to share our design of leader election and leader operation. For
leader operation, it is operation that should be committed only if the
contender is the leader. Also CC Curator mailing list because it also
contains the reason why we cannot JUST use Curator.

The rule we want to keep is

**Writes on ZK must be committed only if the contender is the leader**

We represent contender by an individual ZK client. At the moment we use
Curator for leader election so the algorithm is the same as the
optimized version in this page[2].

The problem is that this algorithm only take care of leader election but
is indifferent to subsequent operations. Consider the scenario below:

1. contender-1 becomes the leader
2. contender-1 proposes a create txn-1
3. sender thread suspended for full gc
4. contender-1 lost leadership and contender-2 becomes the leader
5. contender-1 recovers from full gc, before it reacts to revoke
leadership event, txn-1 retried and sent to ZK.

Without other guard txn will success on ZK and thus contender-1 commit
a write operation even if it is no longer the leader. This issue is
also documented in this note[3].

To overcome this issue instead of just saying that we're unfortunate,
we draft two possible solution.

The first is document here[4]. Briefly, when the contender becomes the
leader, we memorize the latch path at that moment. And for
subsequent operations, we do in a transaction first checking the
existence of the latch path. Leadership is only switched if the latch
gone, and all operations will fail if the latch gone.

The second is still rough. Basically it relies on session expire
mechanism in ZK. We will adopt the unoptimized version in the
recipe[2] given that in our scenario there are only few contenders
at the same time. Thus we create /leader node as ephemeral znode with
leader information and when session expired we think leadership is
revoked and terminate the contender. Asynchronous write operations
should not succeed because they will all fail on session expire.

We cannot adopt 1 using Curator because it doesn't expose the latch
path(which is added recently, but not in the version we use); we
cannot adopt 2 using Curator because although we have to retry on
connection loss but we don't want to retry on session expire. Curator
always creates a new client on session expire and retry the operation.

I'd like to learn from ZooKeeper community that 1. is there any
potential risk if we eventually adopt option 1 or option 2? 2. is
there any other solution we can adopt?


[1] https://issues.apache.org/jira/browse/FLINK-10333
[2] https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection
[3] https://cwiki.apache.org/confluence/display/CURATOR/TN10

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message