zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Zili Chen <wander4...@gmail.com>
Subject Re: Leader election and leader operation based on zookeeper
Date Mon, 30 Sep 2019 04:41:23 GMT
Hi Jordan,

Here is a possible edge case of coordination node way.

   - When an instance becomes leader it:
      - Gets the version of the coordination ZNode
      - Sets the data for that ZNode (the contents don't matter) using the
      retrieved version number
      - If the set succeeds you can be assured you are currently leader
      (otherwise release leadership and re-contend)
      - Save the new version

Actually, it is NOT atomic that an instance becomes leader and it gets the
version of the coordination znode. So an edge case is,

1. instance-1 becomes leader, trying to get the version of the coordination
2. instance-2 becomes leader, update the coordination znode.
3. instance-1 gets the newer version and re-update the coordination znode.

Generally speaking instance-1 suffers session expire but since Curator
retries on session expire that cases above is possible. Although
instance-2 will be mislead that itself not the leader and give up
leadership so that the algorithm can proceed and instance-1 will be
asynchronously notified it is not the leader, before the notification
instance-1 possibly performs some operations already.

Curator should ensure that instance-1 will not regard itself as the leader
with some synchronize logic. Or just use a cached leader latch path
for checking because the leader latch path when it becomes leader is
synchronized to be the exact one. To be more clear, for leader latch
path, I don't mean the volatile field, but the one cached when it becomes


Zili Chen <wander4096@gmail.com> 于2019年9月22日周日 上午2:43写道:

> >the Curator recipes delete and recreate their paths
> However, as mentioned above, we do a one-shot election(doesn't reuse the
> curator recipe) so that
> we check the latch path is always the path in the epoch the contender
> becomes leader. You can check
> out an implementation of the design here[1]. Even we want to enable
> re-contending we can set a guard
> (change state -> track latch path)
> and check the state in LEADING && path existence. ( so we don't misleading
> and check a wrong path )
> Checking version and a coordinate znode sounds another valid solution. I'm
> glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a bit
> :-)
> Best,
> tison.
> [1]
> https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/ZooKeeperLeaderElectionServiceNG.java
> Jordan Zimmerman <jordan@jordanzimmerman.com> 于2019年9月22日周日 上午2:31写道:
>> The issue is that the leader path doesn't stay constant. Every time there
>> is a network partition, etc. the Curator recipes delete and recreate their
>> paths. So, I'm concerned that client code trying to keep track of the
>> leader path would be error prone (it's one reason that they aren't public -
>> it's volatile internal state).
>> -Jordan
>> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4096@gmail.com> wrote:
>> Hi Jordan,
>> >I think using the leader path may not work
>> could you share a situation where this strategy does not work? For the
>> design we do leader contending
>> one-shot and when perform a transaction, checking the existence of latch
>> path && in state LEADING.
>> Given the election algorithm works, state transited to LEADING when its
>> latch path once became
>> the smallest sequential znode. So the existence of latch path guarding
>> that nobody else becoming leader.
>> Jordan Zimmerman <jordan@jordanzimmerman.com> 于2019年9月22日周日 上午12:58写道:
>>> Yeah, Ted - I think this is basically the same thing. We should all try
>>> to poke holes in this.
>>> -JZ
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>> I would suggest that using an epoch number stored in ZK might be
>>> helpful. Every operation that the master takes could be made conditional on
>>> the epoch number using a multi-transaction.
>>> Unfortunately, as you say, you have to have the update of the epoch be
>>> atomic with becoming leader.
>>> The natural way to do this is to have an update of an epoch file be part
>>> of the leader election, but that probably isn't possible using Curator. The
>>> way I would tend to do it would be have a persistent file that is updated
>>> atomically as part of leader election. The version of that persistent file
>>> could then be used as the epoch number. All updates to files that are gated
>>> on the epoch number would only proceed if no other master has been elected,
>>> at least if you use the sync option.
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4096@gmail.com> wrote:
>>>> Hi ZooKeepers,
>>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>>> overcoming several inconsistent state issues on ZK we have met. I come
>>>> here to share our design of leader election and leader operation. For
>>>> leader operation, it is operation that should be committed only if the
>>>> contender is the leader. Also CC Curator mailing list because it also
>>>> contains the reason why we cannot JUST use Curator.
>>>> The rule we want to keep is
>>>> **Writes on ZK must be committed only if the contender is the leader**
>>>> We represent contender by an individual ZK client. At the moment we use
>>>> Curator for leader election so the algorithm is the same as the
>>>> optimized version in this page[2].
>>>> The problem is that this algorithm only take care of leader election but
>>>> is indifferent to subsequent operations. Consider the scenario below:
>>>> 1. contender-1 becomes the leader
>>>> 2. contender-1 proposes a create txn-1
>>>> 3. sender thread suspended for full gc
>>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>>> leadership event, txn-1 retried and sent to ZK.
>>>> Without other guard txn will success on ZK and thus contender-1 commit
>>>> a write operation even if it is no longer the leader. This issue is
>>>> also documented in this note[3].
>>>> To overcome this issue instead of just saying that we're unfortunate,
>>>> we draft two possible solution.
>>>> The first is document here[4]. Briefly, when the contender becomes the
>>>> leader, we memorize the latch path at that moment. And for
>>>> subsequent operations, we do in a transaction first checking the
>>>> existence of the latch path. Leadership is only switched if the latch
>>>> gone, and all operations will fail if the latch gone.
>>>> The second is still rough. Basically it relies on session expire
>>>> mechanism in ZK. We will adopt the unoptimized version in the
>>>> recipe[2] given that in our scenario there are only few contenders
>>>> at the same time. Thus we create /leader node as ephemeral znode with
>>>> leader information and when session expired we think leadership is
>>>> revoked and terminate the contender. Asynchronous write operations
>>>> should not succeed because they will all fail on session expire.
>>>> We cannot adopt 1 using Curator because it doesn't expose the latch
>>>> path(which is added recently, but not in the version we use); we
>>>> cannot adopt 2 using Curator because although we have to retry on
>>>> connection loss but we don't want to retry on session expire. Curator
>>>> always creates a new client on session expire and retry the operation.
>>>> I'd like to learn from ZooKeeper community that 1. is there any
>>>> potential risk if we eventually adopt option 1 or option 2? 2. is
>>>> there any other solution we can adopt?
>>>> Best,
>>>> tison.
>>>> [1] https://issues.apache.org/jira/browse/FLINK-10333
>>>> [2]
>>>> https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection
>>>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10
>>>> [4]
>>>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-nmWrCo/edit?usp=sharing

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message