curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cameron McKenzie (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-134) Curator sends a connection LOST event before sessionTimeout
Date Tue, 19 Aug 2014 06:48:18 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14101930#comment-14101930
] 

Cameron McKenzie commented on CURATOR-134:
------------------------------------------

I think that I've tracked down what the problem is. It can occur when there's connection loss,
followed by connection reestablishment followed by connection loss again. Something along
the lines of the following occurs.

Assuming retry 3 times, 10 second sleep between retries.

-Connected to ZK
-Connection is lost
-Start the background sync that occurs on connection loss. This initially fails because there's
no connection, and gets put on the retry queue to occur in 10 seconds.

Less than 10 seconds passes and the next two events occur
-Connection is reestablished
-Connection is lost

After 10 seconds has passed
-Background retry from previous connection loss is retried. Fails again, gets requeued etc.

The problem is that this 'synch' process has already used one of its configured retries, so
if the connection does not come back before the rest of the retries have expired, then a LOST
event is generated. This is why the LOST event is generated more quickly than expected. Under
a worst case scenario, it would be possible for the sync process to be on its last retry with
a small amount of time left before that retry occurs when connection reestablishment and loss.
This would cause the lost event to happen essentially immediately after the reconnected event.

I'm not sure what the best way to fix this is yet. Ideally, we really want to cancel this
sync process if a connection is reestablished, because if the connection is lost again, then
a new sync process gets generated regardless of whether one is already running. I'm not sure
of the logistics of this though. I'm not sure how practical that is though, will have a bit
more of a dig.

Any thoughts [~randgalt] (or any of the other devs)?


> Curator sends a connection LOST event before sessionTimeout
> -----------------------------------------------------------
>
>                 Key: CURATOR-134
>                 URL: https://issues.apache.org/jira/browse/CURATOR-134
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 2.6.0
>         Environment: Ubuntu 12.04
>            Reporter: Benjamin Jaton
>            Priority: Critical
>         Attachments: Test.java
>
>
> Created a Curator client with:
> - connection timeout: 10 seconds
> - session timeout: 30 seconds
> - retry policy: RetryNTimes(3, 10000)
> A scenario where the ensemble is lost produces the the curator client to send a LOST
event in less than the expected 30 seconds:
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> The client code is attached, this is the complete output:
> Fri Aug 01 11:16:53 PDT 2014 - CURATOR STATE: CONNECTED
> Fri Aug 01 11:16:54 PDT 2014 - Creating ZK client...
> Fri Aug 01 11:16:54 PDT 2014 - ZK client created...
> Fri Aug 01 11:16:54 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:16:58 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:16:58 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:16 PDT 2014 - CURATOR STATE: RECONNECTED
> Fri Aug 01 11:17:17 PDT 2014 - ZOOKEEPER STATE: SyncConnected
> Fri Aug 01 11:17:19 PDT 2014 - ZOOKEEPER STATE: Disconnected
> Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED
> Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST
> I think that the LOST event is actually 30 seconds away from the very first SUSPENDED
event, whereas is should be 30 seconds away from the last one.
> To reproduce it, I started only 2 ZK servers in a 3 nodes ensembles, then I stopped one
of them (-> 1st SUSPENDED), waited for 10-20 seconds, then started it and stopped it again.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message