From user-return-12191-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Tue Oct  1 17:06:28 2019
Return-Path: <user-return-12191-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 236FA180608
	for <archive-asf-public@cust-asf.ponee.io>; Tue,  1 Oct 2019 19:06:28 +0200 (CEST)
Received: (qmail 93054 invoked by uid 500); 1 Oct 2019 17:06:26 -0000
Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:user-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:user-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:user@zookeeper.apache.org>
List-Id: <user.zookeeper.apache.org>
Reply-To: user@zookeeper.apache.org
Delivered-To: mailing list user@zookeeper.apache.org
Received: (qmail 93040 invoked by uid 99); 1 Oct 2019 17:06:25 -0000
Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142)
    by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Oct 2019 17:06:25 +0000
Received: from localhost (localhost [127.0.0.1])
	by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EBFA2C09B1
	for <user@zookeeper.apache.org>; Tue,  1 Oct 2019 17:06:24 +0000 (UTC)
X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org
X-Spam-Flag: NO
X-Spam-Score: 2.002
X-Spam-Level: **
X-Spam-Status: No, score=2.002 tagged_above=-999 required=6.31
	tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2,
	RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001]
	autolearn=disabled
Authentication-Results: spamd1-us-west.apache.org (amavisd-new);
	dkim=pass (2048-bit key)
	header.d=jordanzimmerman-com.20150623.gappssmtp.com
Received: from mx1-he-de.apache.org ([10.40.0.8])
	by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024)
	with ESMTP id faxLVSVzJ9vu for <user@zookeeper.apache.org>;
	Tue,  1 Oct 2019 17:06:21 +0000 (UTC)
Received-SPF: None (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e41; helo=mail-vs1-xe41.google.com; envelope-from=jordan@jordanzimmerman.com; receiver=<UNKNOWN> 
Received: from mail-vs1-xe41.google.com (mail-vs1-xe41.google.com [IPv6:2607:f8b0:4864:20::e41])
	by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 5E8287DE0E
	for <user@zookeeper.apache.org>; Tue,  1 Oct 2019 17:06:20 +0000 (UTC)
Received: by mail-vs1-xe41.google.com with SMTP id y129so8902106vsc.6
        for <user@zookeeper.apache.org>; Tue, 01 Oct 2019 10:06:20 -0700 (PDT)
DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=jordanzimmerman-com.20150623.gappssmtp.com; s=20150623;
        h=from:message-id:mime-version:subject:date:in-reply-to:cc:to
         :references;
        bh=qsKUWTKQ/CMllvbB3qkoRz17eYpVHq+uyVjre70iqBk=;
        b=vbOMOP4JKslLdXuIFmixrVr6jG7GjOPymfstdhH2xPPd3mgWSVjSRGfIgJAFe9WbxJ
         F/vNG56miZuiC+QCK3Ptlc0in4hjtvrgRB6dy9ZnT7TKy5TKmI28v5pervZH4ChY1ZF+
         eHwbLns1jkDQMDOCozVpa0UKPJpreFLB882UuTTqvFz2M5xdWyD8X3HF4nvmsZElbRRF
         simmkmwA5NdRn5vWNGGPpVDfnn0TPL6EWo7Ok/t5QcVKQxfvSR30tHakjAbNwdiqqgYQ
         kHRjkFzI70AmYYQhukimXKqXVnADZjKZVxlv39vrL6h00ro4HxfgDMSlfn5H8FrZzrvf
         fJZQ==
X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed;
        d=1e100.net; s=20161025;
        h=x-gm-message-state:from:message-id:mime-version:subject:date
         :in-reply-to:cc:to:references;
        bh=qsKUWTKQ/CMllvbB3qkoRz17eYpVHq+uyVjre70iqBk=;
        b=tMGg4ZnMU8PWDB/5xS/Kso6e3rQ/m6h/orpaXIs1sCS4Dmdn2phe5lTUU5etClfH4M
         NKp+fK5L3qXfVT3LArsV0lbsK0Vy15MnpItMLpXuZwuxpkUusdDVLT2CcX3gZnoCjUU3
         7Lb7ql0+NXKipJrBtgHYQ4bf9pNH90sOPirKyb+fGPeryRjaxt/HINS1jEPfDt8RgfIm
         VO83+3pw6pp95Kaz72Gb6AisH2bVO9smG7ibG+G4t1aupG1R8S+Y+pv6AY2yQm7XfOjV
         8mdu+vaFnoO1IWnyQRbk7x9AZM5tD0kpkogAhsFxuoWncHmJWwnjKo6xcHEx4gj/mfHc
         31Lw==
X-Gm-Message-State: APjAAAVF2YX8vsWAbIoBldRej86FzP2Zi+srQDk/+n6drGogfVj1RMdt
	SAIvifE8zi5fsj3VMfqwiRWg1A==
X-Google-Smtp-Source: APXvYqyOekkKWshrL/9my2wtG463r/+AyI+awoek7Fikh+p/L/NmW6W7t3wXG37bNHt92BK7AczSSg==
X-Received: by 2002:a67:6746:: with SMTP id b67mr12616161vsc.135.1569949578555;
        Tue, 01 Oct 2019 10:06:18 -0700 (PDT)
Received: from [10.0.1.8] ([186.75.215.49])
        by smtp.gmail.com with ESMTPSA id 6sm8911534vkr.35.2019.10.01.10.06.16
        (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128);
        Tue, 01 Oct 2019 10:06:17 -0700 (PDT)
From: Jordan Zimmerman <jordan@jordanzimmerman.com>
Message-Id: <54D43587-2ED6-40D1-A186-6742E5366108@jordanzimmerman.com>
Content-Type: multipart/alternative;
	boundary="Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32"
Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\))
Subject: Re: Leader election and leader operation based on zookeeper
Date: Tue, 1 Oct 2019 12:06:15 -0500
In-Reply-To: <CALL9TYLWsmQ-+R9N3ohS8VDcYbO662S=9eomz2MXb=VrVwpi+w@mail.gmail.com>
Cc: user@curator.apache.org,
 user@zookeeper.apache.org
To: Zili Chen <wander4096@gmail.com>
References: <CALL9TYLWPz-OtQuFZnLQCpXi2cBO3Fd_mRLGF+RKa5pUWAK6oA@mail.gmail.com>
 <CAJwFCa1gTO_Xq0g9Qs0pR=9fbyMh+oBy5D2WPKgYWKLUiVEC1Q@mail.gmail.com>
 <3D69F15F-9756-4FC8-8FB2-6BAEBC5CCF8A@jordanzimmerman.com>
 <CALL9TYJ9kwT_qj8V2A2W1kS81KOhCwNHSJ44ja91Zu33+iMRag@mail.gmail.com>
 <9935CD66-7652-4809-AD0D-0F6ED62F5673@jordanzimmerman.com>
 <CALL9TYLW5+7bGtmBfwK2orwGK3c+SHA-As8ayQSAF3aYDkxanA@mail.gmail.com>
 <CALL9TYLWsmQ-+R9N3ohS8VDcYbO662S=9eomz2MXb=VrVwpi+w@mail.gmail.com>
X-Mailer: Apple Mail (2.3445.104.11)

--Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32
Content-Transfer-Encoding: quoted-printable
Content-Type: text/plain;
	charset=utf-8

Yes, I think this is a hole. As I've thought more about it I think the =
method you described using the lock node in the transaction is actually =
the best.

-JZ

> On Sep 29, 2019, at 11:41 PM, Zili Chen <wander4096@gmail.com> wrote:
>=20
> Hi Jordan,
>=20
> Here is a possible edge case of coordination node way.
>=20
> When an instance becomes leader it:
> Gets the version of the coordination ZNode
> Sets the data for that ZNode (the contents don't matter) using the =
retrieved version number
> If the set succeeds you can be assured you are currently leader =
(otherwise release leadership and re-contend)
> Save the new version
>=20
> Actually, it is NOT atomic that an instance becomes leader and it gets =
the version of the coordination znode. So an edge case is,
>=20
> 1. instance-1 becomes leader, trying to get the version of the =
coordination znode.
> 2. instance-2 becomes leader, update the coordination znode.
> 3. instance-1 gets the newer version and re-update the coordination =
znode.
>=20
> Generally speaking instance-1 suffers session expire but since Curator =
retries on session expire that cases above is possible. Although
> instance-2 will be mislead that itself not the leader and give up =
leadership so that the algorithm can proceed and instance-1 will be
> asynchronously notified it is not the leader, before the notification =
instance-1 possibly performs some operations already.
>=20
> Curator should ensure that instance-1 will not regard itself as the =
leader with some synchronize logic. Or just use a cached leader latch =
path
> for checking because the leader latch path when it becomes leader is =
synchronized to be the exact one. To be more clear, for leader latch
> path, I don't mean the volatile field, but the one cached when it =
becomes leader.
>=20
> Best,
> tison.
>=20
>=20
> Zili Chen <wander4096@gmail.com <mailto:wander4096@gmail.com>> =
=E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6=97=A5=E5=91=A8=E6=97=A5 =
=E4=B8=8A=E5=8D=882:43=E5=86=99=E9=81=93=EF=BC=9A
> >the Curator recipes delete and recreate their paths
>=20
> However, as mentioned above, we do a one-shot election(doesn't reuse =
the curator recipe) so that
> we check the latch path is always the path in the epoch the contender =
becomes leader. You can check
> out an implementation of the design here[1]. Even we want to enable =
re-contending we can set a guard
>=20
> (change state -> track latch path)
>=20
> and check the state in LEADING && path existence. ( so we don't =
misleading and check a wrong path )
>=20
> Checking version and a coordinate znode sounds another valid solution. =
I'm glad to see it in the future
> Curator version and if there is a valid ticket I can help to dig out a =
bit :-)
>=20
> Best,
> tison.
>=20
> [1] =
https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b7740=
1c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/=
ZooKeeperLeaderElectionServiceNG.java =
<https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b774=
01c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection=
/ZooKeeperLeaderElectionServiceNG.java>
>=20
> Jordan Zimmerman <jordan@jordanzimmerman.com =
<mailto:jordan@jordanzimmerman.com>> =E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6=
=97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=882:31=E5=86=99=E9=81=93=EF=BC=9A=

> The issue is that the leader path doesn't stay constant. Every time =
there is a network partition, etc. the Curator recipes delete and =
recreate their paths. So, I'm concerned that client code trying to keep =
track of the leader path would be error prone (it's one reason that they =
aren't public - it's volatile internal state).
>=20
> -Jordan
>=20
>> On Sep 21, 2019, at 1:26 PM, Zili Chen <wander4096@gmail.com =
<mailto:wander4096@gmail.com>> wrote:
>>=20
>> Hi Jordan,
>>=20
>> >I think using the leader path may not work
>>=20
>> could you share a situation where this strategy does not work? For =
the design we do leader contending
>> one-shot and when perform a transaction, checking the existence of =
latch path && in state LEADING.
>>=20
>> Given the election algorithm works, state transited to LEADING when =
its latch path once became
>> the smallest sequential znode. So the existence of latch path =
guarding that nobody else becoming leader.
>>=20
>>=20
>> Jordan Zimmerman <jordan@jordanzimmerman.com =
<mailto:jordan@jordanzimmerman.com>> =E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6=
=97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=8812:58=E5=86=99=E9=81=93=EF=BC=9A=

>> Yeah, Ted - I think this is basically the same thing. We should all =
try to poke holes in this.
>>=20
>> -JZ
>>=20
>>> On Sep 21, 2019, at 11:54 AM, Ted Dunning <ted.dunning@gmail.com =
<mailto:ted.dunning@gmail.com>> wrote:
>>>=20
>>>=20
>>> I would suggest that using an epoch number stored in ZK might be =
helpful. Every operation that the master takes could be made conditional =
on the epoch number using a multi-transaction.
>>>=20
>>> Unfortunately, as you say, you have to have the update of the epoch =
be atomic with becoming leader.=20
>>>=20
>>> The natural way to do this is to have an update of an epoch file be =
part of the leader election, but that probably isn't possible using =
Curator. The way I would tend to do it would be have a persistent file =
that is updated atomically as part of leader election. The version of =
that persistent file could then be used as the epoch number. All updates =
to files that are gated on the epoch number would only proceed if no =
other master has been elected, at least if you use the sync option.
>>>=20
>>>=20
>>>=20
>>>=20
>>>=20
>>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen <wander4096@gmail.com =
<mailto:wander4096@gmail.com>> wrote:
>>> Hi ZooKeepers,
>>>=20
>>> Recently there is an ongoing refactor[1] in Flink community aimed at
>>> overcoming several inconsistent state issues on ZK we have met. I =
come
>>> here to share our design of leader election and leader operation. =
For
>>> leader operation, it is operation that should be committed only if =
the
>>> contender is the leader. Also CC Curator mailing list because it =
also
>>> contains the reason why we cannot JUST use Curator.
>>>=20
>>> The rule we want to keep is
>>>=20
>>> **Writes on ZK must be committed only if the contender is the =
leader**
>>>=20
>>> We represent contender by an individual ZK client. At the moment we =
use
>>> Curator for leader election so the algorithm is the same as the
>>> optimized version in this page[2].
>>>=20
>>> The problem is that this algorithm only take care of leader election =
but
>>> is indifferent to subsequent operations. Consider the scenario =
below:
>>>=20
>>> 1. contender-1 becomes the leader
>>> 2. contender-1 proposes a create txn-1
>>> 3. sender thread suspended for full gc
>>> 4. contender-1 lost leadership and contender-2 becomes the leader
>>> 5. contender-1 recovers from full gc, before it reacts to revoke
>>> leadership event, txn-1 retried and sent to ZK.
>>>=20
>>> Without other guard txn will success on ZK and thus contender-1 =
commit
>>> a write operation even if it is no longer the leader. This issue is
>>> also documented in this note[3].
>>>=20
>>> To overcome this issue instead of just saying that we're =
unfortunate,
>>> we draft two possible solution.
>>>=20
>>> The first is document here[4]. Briefly, when the contender becomes =
the
>>> leader, we memorize the latch path at that moment. And for
>>> subsequent operations, we do in a transaction first checking the
>>> existence of the latch path. Leadership is only switched if the =
latch
>>> gone, and all operations will fail if the latch gone.
>>>=20
>>> The second is still rough. Basically it relies on session expire
>>> mechanism in ZK. We will adopt the unoptimized version in the
>>> recipe[2] given that in our scenario there are only few contenders
>>> at the same time. Thus we create /leader node as ephemeral znode =
with
>>> leader information and when session expired we think leadership is
>>> revoked and terminate the contender. Asynchronous write operations
>>> should not succeed because they will all fail on session expire.
>>>=20
>>> We cannot adopt 1 using Curator because it doesn't expose the latch
>>> path(which is added recently, but not in the version we use); we
>>> cannot adopt 2 using Curator because although we have to retry on
>>> connection loss but we don't want to retry on session expire. =
Curator
>>> always creates a new client on session expire and retry the =
operation.
>>>=20
>>> I'd like to learn from ZooKeeper community that 1. is there any
>>> potential risk if we eventually adopt option 1 or option 2? 2. is
>>> there any other solution we can adopt?
>>>=20
>>> Best,
>>> tison.
>>>=20
>>> [1] https://issues.apache.org/jira/browse/FLINK-10333 =
<https://issues.apache.org/jira/browse/FLINK-10333>
>>> [2] =
https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection =
<https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection>
>>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 =
<https://cwiki.apache.org/confluence/display/CURATOR/TN10>
>>> [4] =
https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-n=
mWrCo/edit?usp=3Dsharing =
<https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-=
nmWrCo/edit?usp=3Dsharing>
>>>=20
>>=20
>=20


--Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32--