curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vamsi Subhash Achanta (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-209) Background retry falls into infinite loop of reconnection after connection loss
Date Tue, 10 Nov 2015 09:33:11 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-209?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14998304#comment-14998304
] 

Vamsi Subhash Achanta commented on CURATOR-209:
-----------------------------------------------

Even we are noticing this for a long time.

Scenario and sequence of events:
- If one of the zk node (out of 5) goes down or connection is lost (due to n/w blip), the
exceptions start at curator and the retry policy (ExponentialBackOffRetry) retires for ever
though the connection is back.
- We print the ConnectionState using the Curator ConnectionListener and the state changes
back to RECONNECTED very less no of times.
- The retries finally print too many error logs very quickly (Background operation retry gave
up)
- The ConnectionState queue is then filled and begins to drop events: (ConnectionStateManager
queue full - dropping events to make room). We have counted the state changes - there are
only 20-40 of them.
- Only until the process is restarted that this can be fixed.

SessionTimeout: 6000
ConnectionTimeout: 6000
SyncTime: 6000
TickTime: 1000
waitTimeToConnect: 6000
maxRetriesToConnect: 4

This can be easily replicated with our setup (100+ app nodes connecting to 5-node zk cluster).
Please help us fix this. Thanks.

> Background retry falls into infinite loop of reconnection after connection loss
> -------------------------------------------------------------------------------
>
>                 Key: CURATOR-209
>                 URL: https://issues.apache.org/jira/browse/CURATOR-209
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.6.0
>         Environment: sun java jdk 1.7.0_55, curator 2.6.0, zookeeper 3.3.6 on AWS EC2
in a 3 box ensemble
>            Reporter: Ryan Anderson
>            Priority: Critical
>              Labels: connectionloss, loop, reconnect
>
> We've been unable to replicate this in our test environments, but approximately once
a week in production (~50 machine cluster using curator/zk for service discovery) we will
get a machine falling into a loop and spewing tens of thousands of errors that look like:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:496)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.sendBackgroundResponse(CreateBuilderImpl.java:538)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl.access$700(CreateBuilderImpl.java:44)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CreateBuilderImpl$6.processResult(CreateBuilderImpl.java:497)
[curator-framework-2.6.0.jar:na]
> at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:605) [zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:498) [zookeeper-3.4.6.jar:3.4.6-1569965]
> {code}
> The rate at which we get these errors seems to increase linearly until we stop the process
(starts at 10-20/sec, when we kill the box it's typically generating 1,000+/sec)
> When the error first occurs, there's a slightly different stack trace:
> {code}
> Background operation retry gave uporg.apache.zookeeper.KeeperException$ConnectionLossException:
KeeperErrorCode = ConnectionLoss
> at org.apache.zookeeper.KeeperException.create(KeeperException.java:99) ~[zookeeper-3.4.6.jar:3.4.6-1569965]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.checkBackgroundRetry(CuratorFrameworkImpl.java:695)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:813)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
[curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> followed very closely by:
> {code}
> Background retry gave uporg.apache.curator.CuratorConnectionLossException: KeeperErrorCode
= ConnectionLoss
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.performBackgroundOperation(CuratorFrameworkImpl.java:796)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.backgroundOperationsLoop(CuratorFrameworkImpl.java:779)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl.access$400(CuratorFrameworkImpl.java:58)
[curator-framework-2.6.0.jar:na]
> at org.apache.curator.framework.imps.CuratorFrameworkImpl$4.call(CuratorFrameworkImpl.java:265)
[curator-framework-2.6.0.jar:na]
> at java.util.concurrent.FutureTask.run(FutureTask.java:262) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145) [na:1.7.0_55]
> at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615) [na:1.7.0_55]
> at java.lang.Thread.run(Thread.java:745) [na:1.7.0_55]
> {code}
> After which it begins spewing the stack trace I first posted above. We're assuming that
some sort of networking hiccup is occurring in EC2 that's causing the ConnectionLoss, which
seems entirely momentary (none of our other boxes see it, and when we check the box it can
connect to all the zk servers without any issues.) 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message