Return-Path: X-Original-To: apmail-curator-user-archive@minotaur.apache.org Delivered-To: apmail-curator-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8C01E11274 for ; Thu, 22 May 2014 12:40:33 +0000 (UTC) Received: (qmail 83770 invoked by uid 500); 22 May 2014 12:40:31 -0000 Delivered-To: apmail-curator-user-archive@curator.apache.org Received: (qmail 83709 invoked by uid 500); 22 May 2014 12:40:31 -0000 Mailing-List: contact user-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@curator.apache.org Delivered-To: mailing list user@curator.apache.org Received: (qmail 83660 invoked by uid 99); 22 May 2014 12:40:31 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 May 2014 12:40:31 +0000 X-ASF-Spam-Status: No, hits=2.8 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW,SPF_PASS,URI_HEX X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: local policy includes SPF record at spf.trusted-forwarder.org) Received: from [209.85.213.173] (HELO mail-ig0-f173.google.com) (209.85.213.173) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 May 2014 12:40:27 +0000 Received: by mail-ig0-f173.google.com with SMTP id hn18so7517182igb.12 for ; Thu, 22 May 2014 05:40:06 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:date:from:to:message-id:in-reply-to:references :subject:mime-version:content-type; bh=4N1+AgHncgwgpr+Hg3ul7DJ5Dp/AsiDuGoj863lUV1Y=; b=WiQ4yyTX+kkYmckQm4ZAYsMqdjdglfuoGjVgs92ZH6xMDzVxg0gTl9WdDx0Nkh+cOj kCM6lsxeRVgKb2k6xM4lo8n4msyOLWBv1JqYPFuHSatWH+GmS9CkpAB2m32p2w2ebz0v nhb/v+kf7ylT36jgpwny0kfgrDfS0kqHFo/DocIKHJV5X6DOY02zKkpAr9CUmg/kPDe+ tD005nY7Y/dITk8mMu8NkhtIHZRcO6Dw6WenjK6XZPvX4vuV8XVmdD04uZ3+5scZX+gH o6gE72/6YBGZ758R4wTZzG90HB1BoUQCYp7VHAex1miIq8/XZ9ZsDL4ZeubSNtZgboDl V/Pg== X-Gm-Message-State: ALoCoQnT67/8wCOTygtH4nRQMU8X2wrXIOBJlme2X3CfM6c4/I8YZgINsUzJB6KMAy0VE7Efwj3/ X-Received: by 10.50.45.102 with SMTP id l6mr21814791igm.16.1400762406195; Thu, 22 May 2014 05:40:06 -0700 (PDT) Received: from Jordans-MacBook-Pro.local ([190.219.168.35]) by mx.google.com with ESMTPSA id lr6sm13041272igb.15.2014.05.22.05.40.04 for (version=TLSv1.2 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Thu, 22 May 2014 05:40:05 -0700 (PDT) Date: Thu, 22 May 2014 07:39:48 -0500 From: Jordan Zimmerman To: stibi , user@curator.apache.org Message-ID: In-Reply-To: References: Subject: Re: Sometimes leader election ends up in two leaders X-Mailer: Airmail (237) MIME-Version: 1.0 Content-Type: multipart/alternative; boundary="537df023_66334873_18c" X-Virus-Checked: Checked by ClamAV on apache.org --537df023_66334873_18c Content-Type: text/plain; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline What guarantees that zNode 241 will be deleted prior to the (successful) = attempt of client =232 to reacquire the mutex using zNode 241=3F Because that=E2=80=99s how the lock works. As long as 241 exists, no othe= r client will consider itself as having the mutex.=C2=A0 reacquire the mutex using zNode 241=3F This is not what happens. The client will try to acquire using a =5Fdiffe= rent=5F znode. Are you thinking that 241 is re-used=3F It=E2=80=99s not.=C2= =A0 -JZ =46rom:=C2=A0stibi sulyan.tibor=40gmail.com Reply:=C2=A0stibi sulyan.tibor=40gmail.com Date:=C2=A0May 22, 2014 at 7:26:57 AM To:=C2=A0Jordan Zimmerman jordan=40jordanzimmerman.com, user=40curator.ap= ache.org user=40curator.apache.org Subject:=C2=A0 Re: Sometimes leader election ends up in two leaders =20 Hi=21 Thanks for the quick response. About this step: =E2=80=94 Time N + D2 =E2=80=94=C2=A0 The ZooKeeper quorum is repaired and the nodes start a doWork() loop agai= n. At this point, there can be 2, 3 or 4 nodes depending.=C2=A0 lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) lock-0000000242 lock-0000000243 Neither of the instances will achieve leadership until the nodes 240/241 = are deleted. What guarantees that zNode 241 will be deleted prior to the (successful) = attempt of client =232 to reacquire the mutex using zNode 241=3F A=46AIK node deletion is a background operation and a retry policy contro= ls how often a deletion attempt will occur (even for guaranteed deletes).= Unlucky timing can lead to a situation where deletion of zNode 241 happe= ns after the mutex acquisition. In this case the mutex is not released by= the leader, but since the zNodes are deleted, the other client will also= be elected as leader. Thanks, Tibor On Thu, May 15, 2014 at 3:37 AM, Jordan Zimmerman wrote: I don=E2=80=99t think the situation you describe can happen. Let=E2=80=99= s walk through this: =E2=80=94 Time N =E2=80=94=C2=A0 We have a single, correct leader and 2 nodes: lock-0000000240 lock-0000000241 =E2=80=94 Time N + D1 =E2=80=94=C2=A0 ZooKeeper leader instance is restarted. Shortly thereafter, both Curator = clients will exit their doWork() loops and mark their nodes for deletion.= Due to a failed connection, though there are still the 2 nodes: lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) =E2=80=94 Time N + D2 =E2=80=94=C2=A0 The ZooKeeper quorum is repaired and the nodes start a doWork() loop agai= n. At this point, there can be 2, 3 or 4 nodes depending.=C2=A0 lock-0000000240 (waiting to be deleted) lock-0000000241 (waiting to be deleted) lock-0000000242 lock-0000000243 Neither of the instances will achieve leadership until the nodes 240/241 = are deleted. Of course, there may be something else that=E2=80=99s causing you to see = 2 leaders. A while back I discovered that rolling config changes can do i= t (http://zookeeper-user.578899.n2.nabble.com/Rolling-config-change-consi= dered-harmful-td7578761.html). Or, there=E2=80=99s something else going o= n in Curator.=C2=A0 -Jordan =46rom:=C2=A0stibi sulyan.tibor=40gmail.com Reply:=C2=A0user=40curator.apache.org user=40curator.apache.org Date:=C2=A0May 14, 2014 at 11:39:48 AM To:=C2=A0user=40curator.apache.org user=40curator.apache.org Subject:=C2=A0 Sometimes leader election ends up in two leaders Hi=21 I'm using Curator's Leader Election recipe (2.4.2) and found a very hard-= to-reproduce issue which could lead to a situation where both clients bec= ome leader. Let's say 2 clients are competing for leadership, client =231 is currentl= y the leader and zookeeper maintains the following structure under the le= aderPath: /leaderPath =C2=A0 =7C- =5Fc=5Fa8524f0b-3bd7-4df3-ae19-cef11159a7a6-lock-0000000240 (= client =231) =C2=A0 =7C- =5Fc=5Fb5bdc75f-d2c9-4432-9d58-1f7fe699e125-lock-0000000241 (= client =232) autoRequeue flag is set to true for both clients Let's tigger a leader election by restarting the ZooKeeper leader. When this happens, both clients will lose the connection to the ZooKeeper= ensemble and will try to re-acquire the LeaderSelector's mutex. Eventual= ly (after the negotiated session timeout) the ephemeral zNodes under /lea= derPath will be deleted. The problem occurs when ephemeral zNode deletions interleave with mutex a= cquisition. =C2=A0=C2=A0 Client =231 can observe that both zNodes (240 and 241) are already delete= d, /leaderPath has no children so it acquires the mutex successfully. On the other hand, client =232 can observe that both zNodes still exist, = so it starts to watch zNode =23240 (LockInternals.internalLockLoop():315)= . In a short period of time the watcher will be notified about the zNode'= s deletion, so client =232 reenters LockInternals.internalLockLoop(). What is really strange that getSortedChildren() call in LockInternals:284= can still return zNode =23241 so it will succeed in acquiring the mutex (LockInternals:287) The result is two clients, both leader, but /leaderPath contains only one= zNode for client =231 Did you encounter similar problems before=3F Do you have any ideas on how= to prevent such race conditions=3F I can think of a solution: The leader= should watch its zNode under /leaderPath and interrupt leadership when t= he zNode gets deleted. Thank you, Tibor --537df023_66334873_18c Content-Type: text/html; charset="utf-8" Content-Transfer-Encoding: quoted-printable Content-Disposition: inline