Return-Path: X-Original-To: apmail-curator-dev-archive@minotaur.apache.org Delivered-To: apmail-curator-dev-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id D1D02C1F8 for ; Tue, 12 Aug 2014 03:43:12 +0000 (UTC) Received: (qmail 154 invoked by uid 500); 12 Aug 2014 03:43:12 -0000 Delivered-To: apmail-curator-dev-archive@curator.apache.org Received: (qmail 99957 invoked by uid 500); 12 Aug 2014 03:43:12 -0000 Mailing-List: contact dev-help@curator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@curator.apache.org Delivered-To: mailing list dev@curator.apache.org Received: (qmail 99746 invoked by uid 99); 12 Aug 2014 03:43:12 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 12 Aug 2014 03:43:12 +0000 Date: Tue, 12 Aug 2014 03:43:12 +0000 (UTC) From: "Cameron McKenzie (JIRA)" To: dev@curator.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (CURATOR-134) Curator sends a connection LOST event before sessionTimeout MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/CURATOR-134?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14093689#comment-14093689 ] Cameron McKenzie commented on CURATOR-134: ------------------------------------------ I've had a look into this and can certainly reproduce it. It appears that the 'LOST' state will be published if no connection to ZK can be established, regardless of how long the session is. The amount of time it takes for the 'LOST' state to be published depends upon the retry policy, but due to the way that the RetryLoop is implemented, it will block for at least the specified connection timeout on each iteration of the retry loop. Once retries have been exhausted then the state LOST state is published. This can be incorrect. If the client can't connect to ZK for some period of time due to a network glitch, but the ZK cluster is still alive then the session is not LOST and will be reestablished on reconnection (should that occur before session timeout). So, I guess Curator should keep retrying until the configured session timeout expires. > Curator sends a connection LOST event before sessionTimeout > ----------------------------------------------------------- > > Key: CURATOR-134 > URL: https://issues.apache.org/jira/browse/CURATOR-134 > Project: Apache Curator > Issue Type: Bug > Components: Client > Affects Versions: 2.6.0 > Environment: Ubuntu 12.04 > Reporter: Benjamin Jaton > Priority: Critical > Attachments: Test.java > > > Created a Curator client with: > - connection timeout: 10 seconds > - session timeout: 30 seconds > - retry policy: RetryNTimes(3, 10000) > A scenario where the ensemble is lost produces the the curator client to send a LOST event in less than the expected 30 seconds: > Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED > Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST > The client code is attached, this is the complete output: > Fri Aug 01 11:16:53 PDT 2014 - CURATOR STATE: CONNECTED > Fri Aug 01 11:16:54 PDT 2014 - Creating ZK client... > Fri Aug 01 11:16:54 PDT 2014 - ZK client created... > Fri Aug 01 11:16:54 PDT 2014 - ZOOKEEPER STATE: SyncConnected > Fri Aug 01 11:16:58 PDT 2014 - ZOOKEEPER STATE: Disconnected > Fri Aug 01 11:16:58 PDT 2014 - CURATOR STATE: SUSPENDED > Fri Aug 01 11:17:16 PDT 2014 - CURATOR STATE: RECONNECTED > Fri Aug 01 11:17:17 PDT 2014 - ZOOKEEPER STATE: SyncConnected > Fri Aug 01 11:17:19 PDT 2014 - ZOOKEEPER STATE: Disconnected > Fri Aug 01 11:17:19 PDT 2014 - CURATOR STATE: SUSPENDED > Fri Aug 01 11:17:29 PDT 2014 - CURATOR STATE: LOST > I think that the LOST event is actually 30 seconds away from the very first SUSPENDED event, whereas is should be 30 seconds away from the last one. > To reproduce it, I started only 2 ZK servers in a 3 nodes ensembles, then I stopped one of them (-> 1st SUSPENDED), waited for 10-20 seconds, then started it and stopped it again. -- This message was sent by Atlassian JIRA (v6.2#6252)