From user-return-12191-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Tue Oct 1 17:06:28 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 236FA180608 for ; Tue, 1 Oct 2019 19:06:28 +0200 (CEST) Received: (qmail 93054 invoked by uid 500); 1 Oct 2019 17:06:26 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 93040 invoked by uid 99); 1 Oct 2019 17:06:25 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 01 Oct 2019 17:06:25 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id EBFA2C09B1 for ; Tue, 1 Oct 2019 17:06:24 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.002 X-Spam-Level: ** X-Spam-Status: No, score=2.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_NONE=0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=jordanzimmerman-com.20150623.gappssmtp.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id faxLVSVzJ9vu for ; Tue, 1 Oct 2019 17:06:21 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::e41; helo=mail-vs1-xe41.google.com; envelope-from=jordan@jordanzimmerman.com; receiver= Received: from mail-vs1-xe41.google.com (mail-vs1-xe41.google.com [IPv6:2607:f8b0:4864:20::e41]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 5E8287DE0E for ; Tue, 1 Oct 2019 17:06:20 +0000 (UTC) Received: by mail-vs1-xe41.google.com with SMTP id y129so8902106vsc.6 for ; Tue, 01 Oct 2019 10:06:20 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jordanzimmerman-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=qsKUWTKQ/CMllvbB3qkoRz17eYpVHq+uyVjre70iqBk=; b=vbOMOP4JKslLdXuIFmixrVr6jG7GjOPymfstdhH2xPPd3mgWSVjSRGfIgJAFe9WbxJ F/vNG56miZuiC+QCK3Ptlc0in4hjtvrgRB6dy9ZnT7TKy5TKmI28v5pervZH4ChY1ZF+ eHwbLns1jkDQMDOCozVpa0UKPJpreFLB882UuTTqvFz2M5xdWyD8X3HF4nvmsZElbRRF simmkmwA5NdRn5vWNGGPpVDfnn0TPL6EWo7Ok/t5QcVKQxfvSR30tHakjAbNwdiqqgYQ kHRjkFzI70AmYYQhukimXKqXVnADZjKZVxlv39vrL6h00ro4HxfgDMSlfn5H8FrZzrvf fJZQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=qsKUWTKQ/CMllvbB3qkoRz17eYpVHq+uyVjre70iqBk=; b=tMGg4ZnMU8PWDB/5xS/Kso6e3rQ/m6h/orpaXIs1sCS4Dmdn2phe5lTUU5etClfH4M NKp+fK5L3qXfVT3LArsV0lbsK0Vy15MnpItMLpXuZwuxpkUusdDVLT2CcX3gZnoCjUU3 7Lb7ql0+NXKipJrBtgHYQ4bf9pNH90sOPirKyb+fGPeryRjaxt/HINS1jEPfDt8RgfIm VO83+3pw6pp95Kaz72Gb6AisH2bVO9smG7ibG+G4t1aupG1R8S+Y+pv6AY2yQm7XfOjV 8mdu+vaFnoO1IWnyQRbk7x9AZM5tD0kpkogAhsFxuoWncHmJWwnjKo6xcHEx4gj/mfHc 31Lw== X-Gm-Message-State: APjAAAVF2YX8vsWAbIoBldRej86FzP2Zi+srQDk/+n6drGogfVj1RMdt SAIvifE8zi5fsj3VMfqwiRWg1A== X-Google-Smtp-Source: APXvYqyOekkKWshrL/9my2wtG463r/+AyI+awoek7Fikh+p/L/NmW6W7t3wXG37bNHt92BK7AczSSg== X-Received: by 2002:a67:6746:: with SMTP id b67mr12616161vsc.135.1569949578555; Tue, 01 Oct 2019 10:06:18 -0700 (PDT) Received: from [10.0.1.8] ([186.75.215.49]) by smtp.gmail.com with ESMTPSA id 6sm8911534vkr.35.2019.10.01.10.06.16 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Tue, 01 Oct 2019 10:06:17 -0700 (PDT) From: Jordan Zimmerman Message-Id: <54D43587-2ED6-40D1-A186-6742E5366108@jordanzimmerman.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32" Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: Leader election and leader operation based on zookeeper Date: Tue, 1 Oct 2019 12:06:15 -0500 In-Reply-To: Cc: user@curator.apache.org, user@zookeeper.apache.org To: Zili Chen References: <3D69F15F-9756-4FC8-8FB2-6BAEBC5CCF8A@jordanzimmerman.com> <9935CD66-7652-4809-AD0D-0F6ED62F5673@jordanzimmerman.com> X-Mailer: Apple Mail (2.3445.104.11) --Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 Yes, I think this is a hole. As I've thought more about it I think the = method you described using the lock node in the transaction is actually = the best. -JZ > On Sep 29, 2019, at 11:41 PM, Zili Chen wrote: >=20 > Hi Jordan, >=20 > Here is a possible edge case of coordination node way. >=20 > When an instance becomes leader it: > Gets the version of the coordination ZNode > Sets the data for that ZNode (the contents don't matter) using the = retrieved version number > If the set succeeds you can be assured you are currently leader = (otherwise release leadership and re-contend) > Save the new version >=20 > Actually, it is NOT atomic that an instance becomes leader and it gets = the version of the coordination znode. So an edge case is, >=20 > 1. instance-1 becomes leader, trying to get the version of the = coordination znode. > 2. instance-2 becomes leader, update the coordination znode. > 3. instance-1 gets the newer version and re-update the coordination = znode. >=20 > Generally speaking instance-1 suffers session expire but since Curator = retries on session expire that cases above is possible. Although > instance-2 will be mislead that itself not the leader and give up = leadership so that the algorithm can proceed and instance-1 will be > asynchronously notified it is not the leader, before the notification = instance-1 possibly performs some operations already. >=20 > Curator should ensure that instance-1 will not regard itself as the = leader with some synchronize logic. Or just use a cached leader latch = path > for checking because the leader latch path when it becomes leader is = synchronized to be the exact one. To be more clear, for leader latch > path, I don't mean the volatile field, but the one cached when it = becomes leader. >=20 > Best, > tison. >=20 >=20 > Zili Chen > = =E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6=97=A5=E5=91=A8=E6=97=A5 = =E4=B8=8A=E5=8D=882:43=E5=86=99=E9=81=93=EF=BC=9A > >the Curator recipes delete and recreate their paths >=20 > However, as mentioned above, we do a one-shot election(doesn't reuse = the curator recipe) so that > we check the latch path is always the path in the epoch the contender = becomes leader. You can check > out an implementation of the design here[1]. Even we want to enable = re-contending we can set a guard >=20 > (change state -> track latch path) >=20 > and check the state in LEADING && path existence. ( so we don't = misleading and check a wrong path ) >=20 > Checking version and a coordinate znode sounds another valid solution. = I'm glad to see it in the future > Curator version and if there is a valid ticket I can help to dig out a = bit :-) >=20 > Best, > tison. >=20 > [1] = https://github.com/TisonKun/flink/blob/ad51edbfccd417be1b5a1f136e81b0b7740= 1c43a/flink-runtime/src/main/java/org/apache/flink/runtime/leaderelection/= ZooKeeperLeaderElectionServiceNG.java = >=20 > Jordan Zimmerman > =E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6= =97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=882:31=E5=86=99=E9=81=93=EF=BC=9A= > The issue is that the leader path doesn't stay constant. Every time = there is a network partition, etc. the Curator recipes delete and = recreate their paths. So, I'm concerned that client code trying to keep = track of the leader path would be error prone (it's one reason that they = aren't public - it's volatile internal state). >=20 > -Jordan >=20 >> On Sep 21, 2019, at 1:26 PM, Zili Chen > wrote: >>=20 >> Hi Jordan, >>=20 >> >I think using the leader path may not work >>=20 >> could you share a situation where this strategy does not work? For = the design we do leader contending >> one-shot and when perform a transaction, checking the existence of = latch path && in state LEADING. >>=20 >> Given the election algorithm works, state transited to LEADING when = its latch path once became >> the smallest sequential znode. So the existence of latch path = guarding that nobody else becoming leader. >>=20 >>=20 >> Jordan Zimmerman > =E4=BA=8E2019=E5=B9=B49=E6=9C=8822=E6= =97=A5=E5=91=A8=E6=97=A5 =E4=B8=8A=E5=8D=8812:58=E5=86=99=E9=81=93=EF=BC=9A= >> Yeah, Ted - I think this is basically the same thing. We should all = try to poke holes in this. >>=20 >> -JZ >>=20 >>> On Sep 21, 2019, at 11:54 AM, Ted Dunning > wrote: >>>=20 >>>=20 >>> I would suggest that using an epoch number stored in ZK might be = helpful. Every operation that the master takes could be made conditional = on the epoch number using a multi-transaction. >>>=20 >>> Unfortunately, as you say, you have to have the update of the epoch = be atomic with becoming leader.=20 >>>=20 >>> The natural way to do this is to have an update of an epoch file be = part of the leader election, but that probably isn't possible using = Curator. The way I would tend to do it would be have a persistent file = that is updated atomically as part of leader election. The version of = that persistent file could then be used as the epoch number. All updates = to files that are gated on the epoch number would only proceed if no = other master has been elected, at least if you use the sync option. >>>=20 >>>=20 >>>=20 >>>=20 >>>=20 >>> On Fri, Sep 20, 2019 at 1:31 AM Zili Chen > wrote: >>> Hi ZooKeepers, >>>=20 >>> Recently there is an ongoing refactor[1] in Flink community aimed at >>> overcoming several inconsistent state issues on ZK we have met. I = come >>> here to share our design of leader election and leader operation. = For >>> leader operation, it is operation that should be committed only if = the >>> contender is the leader. Also CC Curator mailing list because it = also >>> contains the reason why we cannot JUST use Curator. >>>=20 >>> The rule we want to keep is >>>=20 >>> **Writes on ZK must be committed only if the contender is the = leader** >>>=20 >>> We represent contender by an individual ZK client. At the moment we = use >>> Curator for leader election so the algorithm is the same as the >>> optimized version in this page[2]. >>>=20 >>> The problem is that this algorithm only take care of leader election = but >>> is indifferent to subsequent operations. Consider the scenario = below: >>>=20 >>> 1. contender-1 becomes the leader >>> 2. contender-1 proposes a create txn-1 >>> 3. sender thread suspended for full gc >>> 4. contender-1 lost leadership and contender-2 becomes the leader >>> 5. contender-1 recovers from full gc, before it reacts to revoke >>> leadership event, txn-1 retried and sent to ZK. >>>=20 >>> Without other guard txn will success on ZK and thus contender-1 = commit >>> a write operation even if it is no longer the leader. This issue is >>> also documented in this note[3]. >>>=20 >>> To overcome this issue instead of just saying that we're = unfortunate, >>> we draft two possible solution. >>>=20 >>> The first is document here[4]. Briefly, when the contender becomes = the >>> leader, we memorize the latch path at that moment. And for >>> subsequent operations, we do in a transaction first checking the >>> existence of the latch path. Leadership is only switched if the = latch >>> gone, and all operations will fail if the latch gone. >>>=20 >>> The second is still rough. Basically it relies on session expire >>> mechanism in ZK. We will adopt the unoptimized version in the >>> recipe[2] given that in our scenario there are only few contenders >>> at the same time. Thus we create /leader node as ephemeral znode = with >>> leader information and when session expired we think leadership is >>> revoked and terminate the contender. Asynchronous write operations >>> should not succeed because they will all fail on session expire. >>>=20 >>> We cannot adopt 1 using Curator because it doesn't expose the latch >>> path(which is added recently, but not in the version we use); we >>> cannot adopt 2 using Curator because although we have to retry on >>> connection loss but we don't want to retry on session expire. = Curator >>> always creates a new client on session expire and retry the = operation. >>>=20 >>> I'd like to learn from ZooKeeper community that 1. is there any >>> potential risk if we eventually adopt option 1 or option 2? 2. is >>> there any other solution we can adopt? >>>=20 >>> Best, >>> tison. >>>=20 >>> [1] https://issues.apache.org/jira/browse/FLINK-10333 = >>> [2] = https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection = >>> [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 = >>> [4] = https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-n= mWrCo/edit?usp=3Dsharing = >>>=20 >>=20 >=20 --Apple-Mail=_2E5A0D29-6227-44FD-9358-8A0AD7A93E32--