Mailing-List: contact dev-help@curator.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@curator.apache.org
Date: Mon, 28 Dec 2015 15:14:49 +0000 (UTC)
From: "ASF GitHub Bot (JIRA)" <jira@apache.org>
To: dev@curator.apache.org
Message-ID: <JIRA.12820196.1428699422000.15929.1451315689894@Atlassian.JIRA>
In-Reply-To: <JIRA.12820196.1428699422000@Atlassian.JIRA>
References: <JIRA.12820196.1428699422000@Atlassian.JIRA>
 <JIRA.12820196.1428699422253@arcas>
Subject: [jira] [Commented] (CURATOR-209) Background retry falls into
 infinite loop of reconnection after connection loss
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15072800#comment-15072800 ] 

ASF GitHub Bot commented on CURATOR-209:
----------------------------------------

GitHub user Randgalt opened a pull request:

    https://github.com/apache/curator/pull/120

    [CURATOR-209] Better handling of background errors

    1. Don't queue background operation if the client is closed
    2. Moved findAndDeleteProtectedNodeInBackground code into separate operation that is processed through the standard Curator background code. This way, retries are applied (with sleep), etc. In the previous implementation, errors caused the background check to be run immediately and infinitely.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/apache/curator CURATOR-209

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/curator/pull/120.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #120
    
----
commit 9b68e19a278e025fa5884445a2b2519463b57445
Author: randgalt <randgalt@apache.org>
Date:   2015-12-28T15:08:51Z

    Moved findAndDeleteProtectedNodeInBackground code into separate operation that is processed through the standard Curator background
    code. This way, retries are applied (with sleep), etc. In the previous implementation, errors caused the background check to be run immediately and infinitely.

commit 8dff2d7cf69f21fdc42e31fb33feed990915fcc7
Author: randgalt <randgalt@apache.org>
Date:   2015-12-28T15:11:55Z

    Don't queue background operation if the client is closed

----


> Background retry falls into infinite loop of reconnection after connection loss
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-209
>                 URL: https://issues.apache.org/jira/browse/CURATOR-209
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.6.0
>         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on AWS EC2 in a 3 box ensemble
>            Reporter: Ryan Anderson
>            Priority: Critical
>              Labels: connectionloss, loop, reconnect
>
> We've been unable to replicate this in our test environments, but approximately once a week in production (~50 machine cluster using curator/zk for service discovery) we will get a machine falling into a loop and spewing tens of thousands of errors that look like:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497) [curator-framework-2.6.0.jar:na]
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we stop the process (starts at 10-20/sec, when we kill the box it's typically generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're assuming that some sort of networking hiccup is occurring in EC2 that's causing the ConnectionLoss, which seems entirely momentary (none of our other boxes see it, and when we check the box it can connect to all the zk servers without any issues.) 


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)