Return-Path: X-Original-To: apmail-curator-dev-archive@minotaur.apache.org Delivered-To: apmail-curator-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 2982E186EC for ; Sun, 10 Jan 2016 23:39:40 +0000 (UTC) Received: (qmail 11039 invoked by uid 500); 10 Jan 2016 23:39:40 -0000 Delivered-To: apmail-curator-dev-archive@curator.apache.org Received: (qmail 10970 invoked by uid 500); 10 Jan 2016 23:39:40 -0000 Mailing-List: contact dev-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@curator.apache.org Delivered-To: mailing list dev@curator.apache.org Received: (qmail 10945 invoked by uid 99); 10 Jan 2016 23:39:39 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 10 Jan 2016 23:39:39 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id C66162C14F0 for ; Sun, 10 Jan 2016 23:39:39 +0000 (UTC) Date: Sun, 10 Jan 2016 23:39:39 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: dev@curator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CURATOR-209) Background retry falls into infinite loop of reconnection after connection loss MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15091289#comment-15091289 ] ASF GitHub Bot commented on CURATOR-209: ---------------------------------------- Github user cammckenzie commented on the pull request: https://github.com/apache/curator/pull/120#issuecomment-170406393 Looks good to me > Background retry falls into infinite loop of reconnection after connection loss > ------------------------------------------------------------------------------- > > Key: CURATOR-209 > URL: https://issues.apache.org/jira/browse/CURATOR-209 > Project: Apache Curator > Issue Type: Bug > Components: Framework > Affects Versions: 2.6.0 > Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on AWS EC2 in a 3 box ensemble > Reporter: Ryan Anderson > Priority: Critical > Labels: connectionloss, loop, reconnect > > We've been unable to replicate this in our test environments, but approximately once a week in production (~50 machine cluster using curator/zk for service discovery) we will get a machine falling into a loop and spewing tens of thousands of errors that look like: > {code} > Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497) [curator-framework-2.6.0.jar:na] > at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) [zookeeper-3.4.6.jar:3.4.6-1569965] > at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965] > {code} > The rate at which we get these errors seems to increase linearly until we stop the process (starts at 10-20/sec, when we kill the box it's typically generating 1,000+/sec) > When the error first occurs, there's a slightly different stack trace: > {code} > Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode = ConnectionLoss > at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55] > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > {code} > followed very closely by: > {code} > Background retry gave uporg.apache.curator.CuratorConnectionLossException: KeeperErrorCode = ConnectionLoss > at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58) [curator-framework-2.6.0.jar:na] > at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265) [curator-framework-2.6.0.jar:na] > at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55] > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55] > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55] > at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55] > {code} > After which it begins spewing the stack trace I first posted above. We're assuming that some sort of networking hiccup is occurring in EC2 that's causing the ConnectionLoss, which seems entirely momentary (none of our other boxes see it, and when we check the box it can connect to all the zk servers without any issues.) -- This message was sent by Atlassian JIRA (v6.3.4#6332)