From user-return-12145-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Sat Sep 21 16:58:40 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id B511C180642 for ; Sat, 21 Sep 2019 18:58:39 +0200 (CEST) Received: (qmail 37322 invoked by uid 500); 21 Sep 2019 16:58:38 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 37310 invoked by uid 99); 21 Sep 2019 16:58:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 21 Sep 2019 16:58:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id D12E4C22B7 for ; Sat, 21 Sep 2019 16:58:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.002 X-Spam-Level: ** X-Spam-Status: No, score=2.002 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.001, SPF_HELO_NONE=0.001, SPF_NONE=0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=jordanzimmerman-com.20150623.gappssmtp.com Received: from mx1-ec2-va.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id tVNvVCpHYC3S for ; Sat, 21 Sep 2019 16:58:35 +0000 (UTC) Received-SPF: None (mailfrom) identity=mailfrom; client-ip=209.85.222.54; helo=mail-ua1-f54.google.com; envelope-from=jordan@jordanzimmerman.com; receiver= Received: from mail-ua1-f54.google.com (mail-ua1-f54.google.com [209.85.222.54]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 5E5AABC509 for ; Sat, 21 Sep 2019 16:58:35 +0000 (UTC) Received: by mail-ua1-f54.google.com with SMTP id n63so3185617uan.2 for ; Sat, 21 Sep 2019 09:58:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jordanzimmerman-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:in-reply-to:cc:to :references; bh=S74tMLS5Vf/eucvUspQLOgggAOpo4k6SrNYlVqFELZA=; b=S9PUElsL99JqB1E7DNnzI1UmMJxtdHxnNSFYE7KDHvU6np4PEDQDwJe0f1YHiaXjPD Yj9BEseGVd8gaVPgBLUwwIYmPnAbfYPSpmEjode3JkdGmFTP64hbfBpQ7akK6JmExF9e UMA0ByZbguxRf84PRiljuBPO+NKid5bl+z8gmElua9LzYiLhKW14bqqiBBWyMO9Ck9+Z JU+fl8ucDbRiRZTF2dS2VgRL8JjFQbuE2GVJG/bF9OtsW27egrpk44eolGnciXuRgMNS ROmENGJmIgcbLrvfT2BmeaYGSkdPm2IU2ut3BsfTGbRrRmYHcSRQjxZoSMj/ZD3h6jar BHyg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:from:message-id:mime-version:subject:date :in-reply-to:cc:to:references; bh=S74tMLS5Vf/eucvUspQLOgggAOpo4k6SrNYlVqFELZA=; b=F9m5GHlXOK7C2k9YRPzKqCTJdx3rCJbAAlg+NOW58yoUfRuOWc4U2dYm7CS5XPRWQ8 Hq1uqFMkd+bnQTvJCPdARBAZF6IFmmTEYWXOX+6zWBN+W7ibYj7ACwVABc+zXGNYrFil VqI4mtidM5H87vvdTLdgwSTA0DnPMQkzH+WDFnXmaelQEMadm+i262iM7jDsj02oL0ki qrWgXpnPGNhOoGUChLjB1knZyPYsPaDuVGxLC8lerGJKrrPsq6Y93COOEDyNZvR1+dAL WxBH08DlrDyX+8/elJ58dI2g7knVC0r1sv6zJ2mbX8Jt3naujRg5QukKRgJPhXSWzXQM gWWw== X-Gm-Message-State: APjAAAX57mnQh3hHMjIivqWT6KQHyO6Ss6z7oY59mQ69uOPC/IRbWLfA X4n2//8hBvdp7O9+ZsYIZ9HR2ZTCMDsfjw== X-Google-Smtp-Source: APXvYqyQeNIp2A/zRYKO9jGpYXLSb/ixbfwewZpvlxBO8dhwkvoj6qSqH2NO394rNM/cDsNSb2pT+w== X-Received: by 2002:ab0:14c4:: with SMTP id f4mr12499238uae.46.1569085114689; Sat, 21 Sep 2019 09:58:34 -0700 (PDT) Received: from [10.0.1.5] ([186.72.212.38]) by smtp.gmail.com with ESMTPSA id i3sm1400888uai.20.2019.09.21.09.58.33 (version=TLS1_2 cipher=ECDHE-RSA-AES128-GCM-SHA256 bits=128/128); Sat, 21 Sep 2019 09:58:34 -0700 (PDT) From: Jordan Zimmerman Message-Id: <3D69F15F-9756-4FC8-8FB2-6BAEBC5CCF8A@jordanzimmerman.com> Content-Type: multipart/alternative; boundary="Apple-Mail=_0D915C13-CF53-41FD-B932-A73F4A6BC011" Mime-Version: 1.0 (Mac OS X Mail 12.4 \(3445.104.11\)) Subject: Re: Leader election and leader operation based on zookeeper Date: Sat, 21 Sep 2019 11:58:32 -0500 In-Reply-To: Cc: Zili Chen , user@zookeeper.apache.org To: user@curator.apache.org References: X-Mailer: Apple Mail (2.3445.104.11) --Apple-Mail=_0D915C13-CF53-41FD-B932-A73F4A6BC011 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=us-ascii Yeah, Ted - I think this is basically the same thing. We should all try = to poke holes in this. -JZ > On Sep 21, 2019, at 11:54 AM, Ted Dunning = wrote: >=20 >=20 > I would suggest that using an epoch number stored in ZK might be = helpful. Every operation that the master takes could be made conditional = on the epoch number using a multi-transaction. >=20 > Unfortunately, as you say, you have to have the update of the epoch be = atomic with becoming leader.=20 >=20 > The natural way to do this is to have an update of an epoch file be = part of the leader election, but that probably isn't possible using = Curator. The way I would tend to do it would be have a persistent file = that is updated atomically as part of leader election. The version of = that persistent file could then be used as the epoch number. All updates = to files that are gated on the epoch number would only proceed if no = other master has been elected, at least if you use the sync option. >=20 >=20 >=20 >=20 >=20 > On Fri, Sep 20, 2019 at 1:31 AM Zili Chen > wrote: > Hi ZooKeepers, >=20 > Recently there is an ongoing refactor[1] in Flink community aimed at > overcoming several inconsistent state issues on ZK we have met. I come > here to share our design of leader election and leader operation. For > leader operation, it is operation that should be committed only if the > contender is the leader. Also CC Curator mailing list because it also > contains the reason why we cannot JUST use Curator. >=20 > The rule we want to keep is >=20 > **Writes on ZK must be committed only if the contender is the leader** >=20 > We represent contender by an individual ZK client. At the moment we = use > Curator for leader election so the algorithm is the same as the > optimized version in this page[2]. >=20 > The problem is that this algorithm only take care of leader election = but > is indifferent to subsequent operations. Consider the scenario below: >=20 > 1. contender-1 becomes the leader > 2. contender-1 proposes a create txn-1 > 3. sender thread suspended for full gc > 4. contender-1 lost leadership and contender-2 becomes the leader > 5. contender-1 recovers from full gc, before it reacts to revoke > leadership event, txn-1 retried and sent to ZK. >=20 > Without other guard txn will success on ZK and thus contender-1 commit > a write operation even if it is no longer the leader. This issue is > also documented in this note[3]. >=20 > To overcome this issue instead of just saying that we're unfortunate, > we draft two possible solution. >=20 > The first is document here[4]. Briefly, when the contender becomes the > leader, we memorize the latch path at that moment. And for > subsequent operations, we do in a transaction first checking the > existence of the latch path. Leadership is only switched if the latch > gone, and all operations will fail if the latch gone. >=20 > The second is still rough. Basically it relies on session expire > mechanism in ZK. We will adopt the unoptimized version in the > recipe[2] given that in our scenario there are only few contenders > at the same time. Thus we create /leader node as ephemeral znode with > leader information and when session expired we think leadership is > revoked and terminate the contender. Asynchronous write operations > should not succeed because they will all fail on session expire. >=20 > We cannot adopt 1 using Curator because it doesn't expose the latch > path(which is added recently, but not in the version we use); we > cannot adopt 2 using Curator because although we have to retry on > connection loss but we don't want to retry on session expire. Curator > always creates a new client on session expire and retry the operation. >=20 > I'd like to learn from ZooKeeper community that 1. is there any > potential risk if we eventually adopt option 1 or option 2? 2. is > there any other solution we can adopt? >=20 > Best, > tison. >=20 > [1] https://issues.apache.org/jira/browse/FLINK-10333 = > [2] = https://zookeeper.apache.org/doc/current/recipes.html#sc_leaderElection = > [3] https://cwiki.apache.org/confluence/display/CURATOR/TN10 = > [4] = https://docs.google.com/document/d/1cBY1t0k5g1xNqzyfZby3LcPu4t-wpx57G1xf-n= mWrCo/edit?usp=3Dsharing = >=20 --Apple-Mail=_0D915C13-CF53-41FD-B932-A73F4A6BC011--