Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D6244200B65 for ; Wed, 17 Aug 2016 22:23:32 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D49CA160A8C; Wed, 17 Aug 2016 20:23:32 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A38BC160A6C for ; Wed, 17 Aug 2016 22:23:31 +0200 (CEST) Received: (qmail 62833 invoked by uid 500); 17 Aug 2016 20:23:30 -0000 Mailing-List: contact user-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@curator.apache.org Delivered-To: mailing list user@curator.apache.org Received: (qmail 62823 invoked by uid 99); 17 Aug 2016 20:23:30 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 17 Aug 2016 20:23:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 64745C0C07 for ; Wed, 17 Aug 2016 20:23:30 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.98 X-Spam-Level: * X-Spam-Status: No, score=1.98 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H3=-0.01, RCVD_IN_MSPIKE_WL=-0.01] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=jordanzimmerman-com.20150623.gappssmtp.com Received: from mx2-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id t4piwR-hcPVG for ; Wed, 17 Aug 2016 20:23:27 +0000 (UTC) Received: from mail-yw0-f180.google.com (mail-yw0-f180.google.com [209.85.161.180]) by mx2-lw-us.apache.org (ASF Mail Server at mx2-lw-us.apache.org) with ESMTPS id BED405F366 for ; Wed, 17 Aug 2016 20:23:26 +0000 (UTC) Received: by mail-yw0-f180.google.com with SMTP id j12so66527451ywb.2 for ; Wed, 17 Aug 2016 13:23:26 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=jordanzimmerman-com.20150623.gappssmtp.com; s=20150623; h=from:message-id:mime-version:subject:date:references:to:in-reply-to; bh=Mhn6LzY+339oTh/YOruuloZB8tOvHZ/Oxtgz7lmgAag=; b=mN/1nmntNp2MgiamrEdX4l/fOcUddUE63QFiczgnqvrVdbCpVAlxRavZw1XiMUdh14 VBjQuTm5uIUil6MbzRLdvL5KovqrFRh3XrrL1vNz4rkQ8ZF4MgRG4d5mKkEV3S7XMg6T yFvmvcE/zFH8zvQ8NiWil1U2ywEalNfOoKxTbwVlnA8n56h5H3z/AnOdDxgYkpX1hI6Q WxAEpYvaJwFAIuf5XB4IuBy0xELxsEurrQaeG9Nisas2DvV8UcjG0mfO2vdt9huTGUdQ P7ddaScwXQcq1qp31VXZBpziuNv7hX+KU2WKo1WQEtpd+VMlomtoiBs1qBxqOfYBuRGq TtVQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20130820; h=x-gm-message-state:from:message-id:mime-version:subject:date :references:to:in-reply-to; bh=Mhn6LzY+339oTh/YOruuloZB8tOvHZ/Oxtgz7lmgAag=; b=h9oyjdBx/7/QNTQ9dpJPDOyjcMJHgA+IKaTRN15FF9WmddZs9PoK79G2OuJRf6Y7aY IgaH/wSLQZ64GOkS45XvWwQBqx8aQ7zuTs6clI+kj09rBLqjLpcWRQvQkxzVMwTX0pb9 yGS87Sfc1T4hLKBNcPLnD91qEPSYruiUnebPH/3o4G1FNbj/dnXtWqJ+yzuA/qxaerC5 vDqZ/T4vjv50+Vvwxh6Mz03C+KtYpY/HDMzLBrwIHcmqC0n20ktS02cDbT8vfBlYvjDW FEeZ3P6uwqFaCQ+zfgtxz/ejjjgrGEIzpuQt1hOpVchBWcWp3+0vJI6sEL1ERb1ETIC3 O+9A== X-Gm-Message-State: AEkoouvC0+5la3pwqzjcUIM1ypDMh/PTj1PNeeL0PWAFq08mWa+9KQvcxOZL1BOYdPxzUA== X-Received: by 10.129.120.136 with SMTP id t130mr28771421ywc.97.1471465399552; Wed, 17 Aug 2016 13:23:19 -0700 (PDT) Received: from [10.0.1.4] ([186.72.10.73]) by smtp.gmail.com with ESMTPSA id a11sm16112272ywa.37.2016.08.17.13.23.18 for (version=TLS1 cipher=ECDHE-RSA-AES128-SHA bits=128/128); Wed, 17 Aug 2016 13:23:18 -0700 (PDT) From: Jordan Zimmerman Content-Type: multipart/alternative; boundary="Apple-Mail=_09B61E75-FABC-48C9-A569-B4D8CEBE2B84" Message-Id: <7750B2BE-A70C-432E-953B-A51E53FCDFE2@jordanzimmerman.com> Mime-Version: 1.0 (Mac OS X Mail 9.3 \(3124\)) Subject: Re: Leader Latch question Date: Wed, 17 Aug 2016 15:23:17 -0500 References: <3D09EAC9-75AF-434B-8BB5-A738967CA457@jordanzimmerman.com> To: user@curator.apache.org In-Reply-To: X-Mailer: Apple Mail (2.3124) archived-at: Wed, 17 Aug 2016 20:23:33 -0000 --Apple-Mail=_09B61E75-FABC-48C9-A569-B4D8CEBE2B84 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=utf-8 i apologize - I was thinking of a different recipe. LeaderLatch does = handle partitions internally. Maybe it=E2=80=99s a gc > On Aug 17, 2016, at 3:14 PM, Steve Boyle wrote: >=20 > I should note that we are using version 2.9.1. I believe we rely on = Curator to handle the Lost and Suspended cases, looks like we=E2=80=99d = expect calls to leaderLatchListener.isLeader and = leaderLatchListener.notLeader. We=E2=80=99ve never seen long GCs with = this app, I=E2=80=99ll start logging that. > =20 > Thanks, > Steve > =C2=A0 <> > From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com]=20 > Sent: Wednesday, August 17, 2016 11:23 AM > To: user@curator.apache.org > Subject: Re: Leader Latch question > =20 > * How do you handle CONNECTION_SUSPENDED and CONNECTION_LOST?=20 > * Was there possibly a very long gc? See = https://cwiki.apache.org/confluence/display/CURATOR/TN10 = > =20 > -Jordan > =20 > On Aug 17, 2016, at 1:07 PM, Steve Boyle > wrote: > =20 > I appreciate your response. Any thoughts on how the issue may have = occurred in production? Or thoughts on how to reproduce that scenario? > =20 > In the production case, there were two instances of the app =E2=80=93 = both configured for a list of 5 zookeepers. > =20 > Thanks, > Steve > =20 > From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com = ]=20 > Sent: Wednesday, August 17, 2016 11:03 AM > To: user@curator.apache.org > Subject: Re: Leader Latch question > =20 > Manual removal of the latch node isn=E2=80=99t supported. It would = require the latch to add a watch on its own node and that has = performance/runtime overhead. The recommended behavior is to watch for = connection loss/suspended events and exit your latch when that happens.=20= > =20 > -Jordan > =20 > On Aug 17, 2016, at 12:43 PM, Steve Boyle > wrote: > =20 > I=E2=80=99m using the Leader Latch recipe. I can successfully bring = up two instances of my app and have one become =E2=80=98active=E2=80=99 = and one become =E2=80=98standby=E2=80=99. Most everything works as = expected. We had an issue, in production, when adding a zookeeper to = our existing quorum, both instances of the app became =E2=80=98active=E2=80= =99. Unfortunately, the log files rolled over before we could check for = exceptions. I=E2=80=99ve been trying to reproduce this issue in a test = environment. In my test environment, I have two instances of my app = configured to use a single zookeeper =E2=80=93 this zookeeper is part of = a 5 node quorum and is not currently the leader. I can trigger both = instances of the app to become =E2=80=98active=E2=80=99 if I use zkCli = and manually delete the latch path from the single zookeeper to which my = apps are connected. When I manually delete the latch path, I can see = via debug logging that the instance that was previously =E2=80=98standby=E2= =80=99 gets a notification from zookeeper =E2=80=9CGot WatchedEvent = state:SyncConnected type:NodeDeleted=E2=80=9D. However, the instance = that had already been active gets no notification at all. Is it = expected that manually removing the latch path would only generate = notifications to some instances of my app? > =20 > Thanks, > Steve Boyle --Apple-Mail=_09B61E75-FABC-48C9-A569-B4D8CEBE2B84 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=utf-8 i apologize - I was thinking of a different recipe. = LeaderLatch does handle partitions internally. Maybe it=E2=80=99s a = gc

On Aug 17, 2016, at 3:14 PM, Steve Boyle = <sboyle@connexity.com> wrote:

I should note that we are using version = 2.9.1.  I believe we rely on Curator to handle the Lost and = Suspended cases, looks like we=E2=80=99d expect calls to = leaderLatchListener.isLeader and leaderLatchListener.notLeader.  = We=E2=80=99ve never seen long GCs with this app, I=E2=80=99ll start = logging that.
 
Thanks,
Steve
 
From: Jordan = Zimmerman [mailto:jordan@jordanzimmerman.com] 
Sent: Wednesday, August 17, 2016 = 11:23 AM
To: user@curator.apache.org
Subject: Re: Leader Latch = question
 
* How do you handle CONNECTION_SUSPENDED = and CONNECTION_LOST? 
* Was there possibly a very long gc? = See https://cwiki.apache.org/confluence/display/CURATOR/TN10
 
-Jordan
 
On Aug 17, 2016, at = 1:07 PM, Steve Boyle <sboyle@connexity.com> wrote:
 
I = appreciate your response.  Any thoughts on how the issue may have = occurred in production?  Or thoughts on how to reproduce that = scenario?
 
In the production case, there were two = instances of the app =E2=80=93 both configured for a list of 5 = zookeepers.
 
Thanks,
Steve
 
From: Jordan Zimmerman [mailto:jordan@jordanzimmerman.com] 
Sent: Wednesday, August 17, 2016 = 11:03 AM
To: user@curator.apache.org
Subject: Re: Leader Latch = question
 
Manual removal of the latch node isn=E2=80=99t supported. It = would require the latch to add a watch on its own node and that has = performance/runtime overhead. The recommended behavior is to watch for = connection loss/suspended events and exit your latch when that = happens. 
 
-Jordan
 
On = Aug 17, 2016, at 12:43 PM, Steve Boyle <sboyle@connexity.com> wrote:
 
I=E2=80=99m using the Leader Latch recipe.  = I can successfully bring up two instances of my app and have one become = =E2=80=98active=E2=80=99 and one become =E2=80=98standby=E2=80=99.  = Most everything works as expected.  We had an issue, in production, = when adding a zookeeper to our existing quorum, both instances of the = app became =E2=80=98active=E2=80=99.  Unfortunately, the log files = rolled over before we could check for exceptions.  I=E2=80=99ve = been trying to reproduce this issue in a test environment.  In my = test environment, I have two instances of my app configured to use a = single zookeeper =E2=80=93 this zookeeper is part of a 5 node quorum and = is not currently the leader.  I can trigger both instances of the = app to become =E2=80=98active=E2=80=99 if I use zkCli and manually = delete the latch path from the single zookeeper to which my apps are = connected.  When I manually delete the latch path, I can see via = debug logging that the instance that was previously =E2=80=98standby=E2=80= =99 gets a notification from zookeeper =E2=80=9CGot WatchedEvent = state:SyncConnected type:NodeDeleted=E2=80=9D.  However, the = instance that had already been active gets no notification at all.  = Is it expected that manually removing the latch path would only generate = notifications to some instances of my app?
 
Thanks,
Steve = Boyle
<= /blockquote>

= --Apple-Mail=_09B61E75-FABC-48C9-A569-B4D8CEBE2B84--