From user-return-12149-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Sat Sep 21 18:44:48 2019
Return-Path: <user-return-12149-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 082EC180642
	for <archive-asf-public@cust-asf.ponee.io>; Sat, 21 Sep 2019 20:44:47 +0200 (CEST)
Received: (qmail 7858 invoked by uid 500); 21 Sep 2019 18:44:47 -0000
Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:user@zookeeper.apache.org>
List-Id: <user.zookeeper.apache.org>
Reply-To: user@zookeeper.apache.org
Delivered-To: mailing list user@zookeeper.apache.org
Received: (qmail 7846 invoked by uid 99); 21 Sep 2019 18:44:46 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Sep 2019 18:44:46 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 19C9C1A446B
	for <user@zookeeper.apache.org>; Sat, 21 Sep 2019 18:44:46 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.051
X-Spam-Level: **
X-Spam-Status: No, score=2.051 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1,
	DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001,
	URIBL_BLOCKED=0.001] autolearn=disabled
Authentication-Results: spamd2-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key) header.d=gmail.com
Received: from mx1-he-de.apache.org ([10.40.0.8])
	by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024)
	with ESMTP id st5R1NFQLZOa for <user@zookeeper.apache.org>;
	Sat, 21 Sep 2019 18:44:43 +0000 (UTC)
Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::643; helo=mail-pl1-x643.google.com; envelope-from=wander4096@gmail.com; receiver=<UNKNOWN> 
Received: from mail-pl1-x643.google.com (mail-pl1-x643.google.com [IPv6:2607:f8b0:4864:20::643])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 79B5B7DD54
	for <user@zookeeper.apache.org>; Sat, 21 Sep 2019 18:44:42 +0000 (UTC)
Received: by mail-pl1-x643.google.com with SMTP id e5so4678171pls.9
        for <user@zookeeper.apache.org>; Sat, 21 Sep 2019 11:44:42 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=gmail.com; s=20161025;
        h=mime-version:references:in-reply-to:from:date:message-id:subject:to
         :cc;
        bh=ijlpuCAj3mDNCKADs3AsBlRAQVgBVg6bhTRWBGJSmwo=;
        b=Z6ztA2KFGSz76J7GH/sIZ26awH8Ma/EOEcQStJI+C8gajuJBSY/L5bSP43kdiEY9Qj
         pzC13VNWgdaCFUnNMt0N0Y6NYRjctpbNGL6xarFpuiOFG6baoCzvKrB+u4v/fa5PjiYt
         jX6HcxsCY8yvL4xvUK7gWK8y4iwHrPmOZp+5lFyI4+Xpj01T3Q5hZKUerNzFswwvdi/9
         ebXYHpNzs1QpiADp0TobZtRP11JcFJsPORZ77Cs0bft2db8OYdqoO8VEyt/jPgEJMtqG
         eMRFqwcpjIKLN+yCEysHvmx5R58evrcYRjB7gwiKDRJ4ps7zUYaD/tr765mGKDKuUnJz
         3H4g==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:mime-version:references:in-reply-to:from:date
         :message-id:subject:to:cc;
        bh=ijlpuCAj3mDNCKADs3AsBlRAQVgBVg6bhTRWBGJSmwo=;
        b=PzbK3iUCOaT2rQFg6vTbFjaOdFv7qj8xV7hmCsdatzO2/4ewpebFc2I4fYjsQndIyK
         EPiEMz5eADn3/iBiU9n9FcezyMBbCU4ng8YKAu6R3+GSUwak8Ud4a264+8Hw2/B4UM2K
         +0Hgh7P812gWAZdwy3r7jxFBIoYuqAIz0c0tWbbRedTcIavWLqWHgUoREvbDNtHlOiWI
         eeaCXEUuva2Oms+E43EWwWdHrG8wy2KRKyU7TZqX6OTIR3nPYJaIE7ohNXguoSv4rINS
         oiLxRbMUsHk/RNdvs4VlCAqqRq6+4AH2YNO10QJUkHvesotaEMJh9ChAVb136dKhJtFX
         jnbg==
X-Gm-Message-State: APjAAAU8SZ8PJt+c5KsXt+tjdUnD/MSCbXDMQNe3ozi/CLz2Dp371kj7
	g/KYc1jkTDfvsf74hlpm2/1r3UAaMG6/5iq5tW8qLtbiXQs=
X-Google-Smtp-Source: APXvYqxQ/XNxycWn0QeozBtg+iB8RAqbK1aFdh6JRkEYQNR64IWMwj2xLGwYhOCidfJJmvex/aLZufqjdLqAQmiQt6Q=
X-Received: by 2002:a17:902:8649:: with SMTP id y9mr21870882plt.252.1569091474479;
 Sat, 21 Sep 2019 11:44:34 -0700 (PDT)
MIME-Version: 1.0
References: <CALL9TYLWPz-OtQuFZnLQCpXi2cBO3Fd_mRLGF+RKa5pUWAK6oA@mail.gmail.com>
 <CAJwFCa1gTO_Xq0g9Qs0pR=9fbyMh+oBy5D2WPKgYWKLUiVEC1Q@mail.gmail.com>
 <3D69F15F-9756-4FC8-8FB2-6BAEBC5CCF8A@jordanzimmerman.com>
 <CALL9TYJ9kwT_qj8V2A2W1kS81KOhCwNHSJ44ja91Zu33+iMRag@mail.gmail.com> <9935CD66-7652-4809-AD0D-0F6ED62F5673@jordanzimmerman.com>
In-Reply-To: <9935CD66-7652-4809-AD0D-0F6ED62F5673@jordanzimmerman.com>
From: Zili Chen <wander4096@gmail.com>
Date: Sun, 22 Sep 2019 02:43:56 +0800
Message-ID: <CALL9TYLW5+7bGtmBfwK2orwGK3c+SHA-As8ayQSAF3aYDkxanA@mail.gmail.com>
Subject: Re: Leader election and leader operation based on zookeeper
To: Jordan Zimmerman <jordan@jordanzimmerman.com>
Cc: user@curator.apache.org, user@zookeeper.apache.org
Content-Type: multipart/alternative; boundary="000000000000f4e0270593149374"

--000000000000f4e0270593149374
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

>the Curator recipes delete and recreate their paths

However, as mentioned above, we do a one-shot election(doesn't reuse the
curator recipe) so that
we check the latch path is always the path in the epoch the contender
becomes leader. You can check
out an implementation of the design here[1]. Even we want to enable
re-contending we can set a guard

(change state -> track latch path)

and check the state in LEADING && path existence. ( so we don't misleading
and check a wrong path )

Checking version and a coordinate znode sounds another valid solution. I'm
glad to see it in the future
Curator version and if there is a valid ticket I can help to dig out a bit
:-)

Best,
tison.

[1]
https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b77401=
c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/Zo=
oKeeperLeaderElectionServiceNG.java


Jordan Zimmerman <jordan@jordanzimmerman.com> =E4=BA=8E2019=E5=B9=B49=E6=9C=
=8822=E6=97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=882:31=E5=86=99=E9=81=93=
=EF=BC=9A

> The issue is that the leader path doesn't stay constant. Every time there
> is a network partition, etc. the Curator recipes delete and recreate thei=
r
> paths. So, I'm concerned that client code trying to keep track of the
> leader path would be error prone (it's one reason that they aren't public=
 -
> it's volatile internal state).
>
> -Jordan
>
> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4096@gmail.com> wrote:
>
> Hi Jordan,
>
> >I think using the leader path may not work
>
> could you share a situation where this strategy does not work? For the
> design we do leader contending
> one-shot and when perform a transaction, checking the existence of latch
> path && in state LEADING.
>
> Given the election algorithm works, state transited to LEADING when its
> latch path once became
> the smallest sequential znode. So the existence of latch path guarding
> that nobody else becoming leader.
>
>
> Jordan Zimmerman <jordan@jordanzimmerman.com> =E4=BA=8E2019=E5=B9=B49=E6=
=9C=8822=E6=97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=8812:58=E5=86=99=E9=81=
=93=EF=BC=9A
>
>> Yeah, Ted - I think this is basically the same thing. We should all try
>> to poke holes in this.
>>
>> -JZ
>>
>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunning@gmail.com> wrote:
>>
>>
>> I would suggest that using an epoch number stored in ZK might be helpful=
.
>> Every operation that the master takes could be made conditional on the
>> epoch number using a multi-transaction.
>>
>> Unfortunately, as you say, you have to have the update of the epoch be
>> atomic with becoming leader.
>>
>> The natural way to do this is to have an update of an epoch file be part
>> of the leader election, but that probably isn't possible using Curator. =
The
>> way I would tend to do it would be have a persistent file that is update=
d
>> atomically as part of leader election. The version of that persistent fi=
le
>> could then be used as the epoch number. All updates to files that are ga=
ted
>> on the epoch number would only proceed if no other master has been elect=
ed,
>> at least if you use the sync option.
>>
>>
>>
>>
>>
>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4096@gmail.com> wrote:
>>
>>> Hi ZooKeepers,
>>>
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I come
>>> here to share our design of leader election and leader operation. For
>>> leader operation, it is operation that should be committed only if the
>>> contender is the leader. Also CC Curator mailing list because it also
>>> contains the reason why we cannot JUST use Curator.
>>>
>>> The rule we want to keep is
>>>
>>> **Writes on ZK must be committed only if the contender is the leader**
>>>
>>> We represent contender by an individual ZK client. At the moment we use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>>
>>> The problem is that this algorithm only take care of leader election bu=
t
>>> is indifferent to subsequent operations. Consider the scenario below:
>>>
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>>
>>> Without other guard txn will success on ZK and thus contender-1 commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>>
>>> To overcome this issue instead of just saying that we're unfortunate,
>>> we draft two possible solution.
>>>
>>> The first is document here[4]. Briefly, when the contender becomes the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched if the latch
>>> gone, and all operations will fail if the latch gone.
>>>
>>> The second is still rough. Basically it relies on session expire
>>> mechanism in ZK. We will adopt the unoptimized version in the
>>> recipe[2] given that in our scenario there are only few contenders
>>> at the same time. Thus we create /leader node as ephemeral znode with
>>> leader information and when session expired we think leadership is
>>> revoked and terminate the contender. Asynchronous write operations
>>> should not succeed because they will all fail on session expire.
>>>
>>> We cannot adopt 1 using Curator because it doesn't expose the latch
>>> path(which is added recently, but not in the version we use); we
>>> cannot adopt 2 using Curator because although we have to retry on
>>> connection loss but we don't want to retry on session expire. Curator
>>> always creates a new client on session expire and retry the operation.
>>>
>>> I'd like to learn from ZooKeeper community that 1. is there any
>>> potential risk if we eventually adopt option 1 or option 2? 2. is
>>> there any other solution we can adopt?
>>>
>>> Best,
>>> tison.
>>>
>>> [1] https://issues.apache.org/jira/browse/FLINK-10333
>>> [2]
>>> https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection
>>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10
>>> [4]
>>> https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1x=
f-nmWrCo/edit?usp=3Dsharing
>>>
>>>
>>>
>>
>

--000000000000f4e0270593149374--