Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id EF8AE200B65 for ; Wed, 17 Aug 2016 22:22:14 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id EE0D7160A6C; Wed, 17 Aug 2016 20:22:14 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id C0148160A8C for ; Wed, 17 Aug 2016 22:22:13 +0200 (CEST) Received: (qmail 57836 invoked by uid 500); 17 Aug 2016 20:22:09 -0000 Mailing-List: contact user-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@curator.apache.org Delivered-To: mailing list user@curator.apache.org Received: (qmail 57800 invoked by uid 99); 17 Aug 2016 20:22:09 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2016 20:22:09 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 3CFA31804C1 for ; Wed, 17 Aug 2016 20:22:09 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.28 X-Spam-Level: * X-Spam-Status: No, score=1.28 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_LOW=-0.7, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=jordanzimmerman-com.20150623.gappssmtp.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id op_rIMpq45hr for ; Wed, 17 Aug 2016 20:22:05 +0000 (UTC) Received: from mail-yb0-f182.google.com (mail-yb0-f182.google.com [209.85.213.182]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 9F9805F23C for ; Wed, 17 Aug 2016 20:22:05 +0000 (UTC) Received: by mail-yb0-f182.google.com with SMTP id r187so10027592ybr.0 for ; Wed, 17 Aug 2016 13:22:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jordanzimmerman-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:references:to:in-reply-to; bh=QGnoAy+96/TohyauLeU9ga5YxRNwj50hGFwGGp89UPU=; b=fEelYWugK8gboMeeycyZbLFBFQZ0WOiA316QMvRjlSCT/AYiW2whpuAY2R/4D/BGHp fldQBKwzHetgEsDy1EEjKG2mPI9CWD+kEmUnypz53wPur29eahtReiwHMpxsBkaOwlPf l5nNvaegkPBbYau3xAw/I+ttdh/FkVW0jyn8dO8lP1YCI2ypzmLTSRw52j2nAooLsAlX c+lJD/vLbPNpA4PuN5J/1hkf9xjRfGD3nc+5ywdkcjn0eUNc7WpGwS/MFPxAjR+sGojC FXeXK97aiDPE/kV6qPEukv8PV3eaaP3cAeBqnxGV2OdS0Ec2dcWmPbktLXvMsBOpEwym 1Ocg== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to; bh=QGnoAy+96/TohyauLeU9ga5YxRNwj50hGFwGGp89UPU=; b=UqsfKB75LbyHsMKSxES/3U7lHyZfhNtbj/tWA6DprG76HMxGjxgKjr9jxVeW95XuK8 x3PGXe4quULl8+8YgH2IvxLo/vZ8q8Cu0E7MenOtAoGgZBw23m2JzkrGu8upHwjOfu9E gEaBkHDn6FJIzq3skpw6/m/bR5qeIHZXime+E88gFPxLp6+6xZAVk/1kfOuhL0te08ii n2BPMhzswosEMVmX3dRKuLLnOmB0yp1B/CQRgxMNkwOjNYhtB4bV53GdQDyh7kxty06T zD8KLndIjbPh9pjGURbrB+MKFSkNNHZU5Hx/SOKUPDqaWS4GTcQKFHg8AXXWws13l3to oeZg== X-Gm-Message-State: AEkoouvs4KjAUeaCu7FW28oP6HY6PSXWYfKsdh0izkRdyWHC5oMk0JgjLYZabc4Vhm8EHg== X-Received: by 10.37.30.65 with SMTP id e62mr28635225ybe.155.1471465319024; Wed, 17 Aug 2016 13:21:59 -0700 (PDT) Received: from [10.0.1.4] ([186.72.10.73]) by smtp.gmail.com with ESMTPSA id j7sm16118476ywb.30.2016.08.17.13.21.56 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 17 Aug 2016 13:21:57 -0700 (PDT) From: Jordan Zimmerman Content-Type: multipart/alternative; boundary="Apple-Mail=_CA150A19-EED0-496B-9D9C-11D16C9AD139" Message-Id: <5EBD6976-7EAF-4E0E-9FDB-C36E45DFC12F@jordanzimmerman.com> Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Leader Latch question Date: Wed, 17 Aug 2016 15:21:55 -0500 References: <3D09EAC9-75AF-434B-8BB5-A738967CA457@jordanzimmerman.com> To: user@curator.apache.org In-Reply-To: X-Mailer: Apple Mail (2.3124) archived-at: Wed, 17 Aug 2016 20:22:15 -0000 --Apple-Mail=_CA150A19-EED0-496B-9D9C-11D16C9AD139 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 No - notLeader() will not get called automatically when there=E2=80=99s = a network partition. Please see: http://curator.apache.org/errors.html and http://curator.apache.org/curator-recipes/leader-latch.html - Error = Handling -Jordan > On Aug 17, 2016, at 3:14 PM, Steve Boyle wrote: >=20 > I should note that we are using version 2.9.1. I believe we rely on = Curator to handle the Lost and Suspended cases, looks like we=E2=80=99d = expect calls to leaderLatchListener.isLeader and = leaderLatchListener.notLeader. We=E2=80=99ve never seen long GCs with = this app, I=E2=80=99ll start logging that. > =20 > Thanks, > Steve > =C2=A0 <> > From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]=20 > Sent: Wednesday, August 17, 2016 11:23 AM > To: user@curator.apache.org > Subject: Re: Leader Latch question > =20 > * How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST?=20 > * Was there possibly a very long gc? See = https://cwiki.apache.org/confluence/display/CURATOR/TN10 = > =20 > -Jordan > =20 > On Aug 17, 2016, at 1:07 PM, Steve Boyle > wrote: > =20 > I appreciate your response. Any thoughts on how the issue may have = occurred in production? Or thoughts on how to reproduce that scenario? > =20 > In the production case, there were two instances of the app =E2=80=93 = both configured for a list of 5 zookeepers. > =20 > Thanks, > Steve > =20 > From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com = ]=20 > Sent: Wednesday, August 17, 2016 11:03 AM > To: user@curator.apache.org > Subject: Re: Leader Latch question > =20 > Manual removal of the latch node isn=E2=80=99t supported. It would = require the latch to add a watch on its own node and that has = performance/runtime overhead. The recommended behavior is to watch for = connection loss/suspended events and exit your latch when that happens.=20= > =20 > -Jordan > =20 > On Aug 17, 2016, at 12:43 PM, Steve Boyle > wrote: > =20 > I=E2=80=99m using the Leader Latch recipe. I can successfully bring = up two instances of my app and have one become =E2=80=98active=E2=80=99 = and one become =E2=80=98standby=E2=80=99. Most everything works as = expected. We had an issue, in production, when adding a zookeeper to = our existing quorum, both instances of the app became =E2=80=98active=E2=80= =99. Unfortunately, the log files rolled over before we could check for = exceptions. I=E2=80=99ve been trying to reproduce this issue in a test = environment. In my test environment, I have two instances of my app = configured to use a single zookeeper =E2=80=93 this zookeeper is part of = a 5 node quorum and is not currently the leader. I can trigger both = instances of the app to become =E2=80=98active=E2=80=99 if I use zkCli = and manually delete the latch path from the single zookeeper to which my = apps are connected. When I manually delete the latch path, I can see = via debug logging that the instance that was previously =E2=80=98standby=E2= =80=99 gets a notification from zookeeper =E2=80=9CGot WatchedEvent = state:SyncConnected type:NodeDeleted=E2=80=9D. However, the instance = that had already been active gets no notification at all. Is it = expected that manually removing the latch path would only generate = notifications to some instances of my app? > =20 > Thanks, > Steve Boyle --Apple-Mail=_CA150A19-EED0-496B-9D9C-11D16C9AD139 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8
No - notLeader() will not get called = automatically when there=E2=80=99s a network partition. Please = see:


and


-Jordan

On Aug 17, 2016, at 3:14 PM, Steve Boyle = <sboyle@connexity.com> wrote:

I should note that we are using version = 2.9.1.  I believe we rely on Curator to handle the Lost and = Suspended cases, looks like we=E2=80=99d expect calls to = leaderLatchListener.isLeader and leaderLatchListener.notLeader.  = We=E2=80=99ve never seen long GCs with this app, I=E2=80=99ll start = logging that.
 
Thanks,
Steve
 
From: Jordan = Zimmerman [mailto:jordan@jordanzimmerman.com] 
Sent: Wednesday, August 17, 2016 = 11:23 AM
To: user@curator.apache.org
Subject: Re: Leader Latch = question
 
* How do you handle CONNECTION_SUSPENDED = and CONNECTION_LOST? 
* Was there possibly a very long gc? = See https://cwiki.apache.org/confluence/display/CURATOR/TN10
 
-Jordan
 
On Aug 17, 2016, at = 1:07 PM, Steve Boyle <sboyle@connexity.com> wrote:
 
I = appreciate your response.  Any thoughts on how the issue may have = occurred in production?  Or thoughts on how to reproduce that = scenario?
 
In the production case, there were two = instances of the app =E2=80=93 both configured for a list of 5 = zookeepers.
 
Thanks,
Steve
 
From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
Sent: Wednesday, August 17, 2016 = 11:03 AM
To: user@curator.apache.org
Subject: Re: Leader Latch = question
 
Manual removal of the latch node isn=E2=80=99t supported. It = would require the latch to add a watch on its own node and that has = performance/runtime overhead. The recommended behavior is to watch for = connection loss/suspended events and exit your latch when that = happens. 
 
-Jordan
 
On = Aug 17, 2016, at 12:43 PM, Steve Boyle <sboyle@connexity.com> wrote:
 
I=E2=80=99m using the Leader Latch recipe.  = I can successfully bring up two instances of my app and have one become = =E2=80=98active=E2=80=99 and one become =E2=80=98standby=E2=80=99.  = Most everything works as expected.  We had an issue, in production, = when adding a zookeeper to our existing quorum, both instances of the = app became =E2=80=98active=E2=80=99.  Unfortunately, the log files = rolled over before we could check for exceptions.  I=E2=80=99ve = been trying to reproduce this issue in a test environment.  In my = test environment, I have two instances of my app configured to use a = single zookeeper =E2=80=93 this zookeeper is part of a 5 node quorum and = is not currently the leader.  I can trigger both instances of the = app to become =E2=80=98active=E2=80=99 if I use zkCli and manually = delete the latch path from the single zookeeper to which my apps are = connected.  When I manually delete the latch path, I can see via = debug logging that the instance that was previously =E2=80=98standby=E2=80= =99 gets a notification from zookeeper =E2=80=9CGot WatchedEvent = state:SyncConnected type:NodeDeleted=E2=80=9D.  However, the = instance that had already been active gets no notification at all.  = Is it expected that manually removing the latch path would only generate = notifications to some instances of my app?
 
Thanks,
Steve = Boyle
<= /blockquote>

= --Apple-Mail=_CA150A19-EED0-496B-9D9C-11D16C9AD139--