Return-Path: X-Original-To: apmail-curator-user-archive@minotaur.apache.org Delivered-To: apmail-curator-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5710F10F26 for ; Tue, 18 Jun 2013 18:56:07 +0000 (UTC) Received: (qmail 88799 invoked by uid 500); 18 Jun 2013 18:56:07 -0000 Delivered-To: apmail-curator-user-archive@curator.apache.org Received: (qmail 88771 invoked by uid 500); 18 Jun 2013 18:56:06 -0000 Mailing-List: contact user-help@curator.incubator.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@curator.incubator.apache.org Delivered-To: mailing list user@curator.incubator.apache.org Received: (qmail 88763 invoked by uid 99); 18 Jun 2013 18:56:06 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Jun 2013 18:56:06 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=HTML_MESSAGE,RCVD_IN_DNSWL_LOW X-Spam-Check-By: apache.org Received-SPF: error (nike.apache.org: local policy) Received: from [209.85.160.50] (HELO mail-pb0-f50.google.com) (209.85.160.50) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Jun 2013 18:55:59 +0000 Received: by mail-pb0-f50.google.com with SMTP id wz7so4147006pbc.23 for ; Tue, 18 Jun 2013 11:55:17 -0700 (PDT) X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=google.com; s=20120113; h=from:content-type:message-id:mime-version:subject:date:references :to:in-reply-to:x-mailer:x-gm-message-state; bh=pjDgJdF0QPuH6eo8Kb9q9XJrYTWe8Hv2+vS+bemiFl8=; b=D8DHt/YT4SRHAn+KD2oY5YdDY1mtoHOMWPAf09YvlkMEvy3siBOF+RS8KMts4+RAjX bjsaxJoJ7O5nH5OvdQyJLzA4v7Vy4UsZeX5Yp6vW/jNV4UNHc5qeYEdkGuA4otSq47le ZrTpkWT3EU5XLaC6gUsKUcSTvdFmmMKa/oJKZuf1PSITUre8bPkBXPKKBjHi6DqwujLR RMbodCBSg6iSHVTTxxB6kV4ufTdJgo2mX7vB94qjGxb8Vh2dcpoTyCbBlswkwSiMYWta FPqh69oQ9jWck9C4hiTmGPdPzr2aPSqXu6739tCGMtmddW0XGijFNAKxXxc3R0lg9oSs TbtA== X-Received: by 10.68.137.195 with SMTP id qk3mr18559011pbb.100.1371581717201; Tue, 18 Jun 2013 11:55:17 -0700 (PDT) Received: from ?IPv6:2607:fb10:2:244:89fe:359e:b496:bb85? ([2607:fb10:2:244:89fe:359e:b496:bb85]) by mx.google.com with ESMTPSA id g8sm15438821pbq.6.2013.06.18.11.55.15 for (version=TLSv1 cipher=ECDHE-RSA-RC4-SHA bits=128/128); Tue, 18 Jun 2013 11:55:16 -0700 (PDT) From: Jordan Zimmerman Content-Type: multipart/alternative; boundary="Apple-Mail=_D1EF45B3-416E-4BB8-9D5C-79458D6C9001" Message-Id: <3452C15F-67F3-4F11-A3D5-1DA2A6327B9D@jordanzimmerman.com> Mime-Version: 1.0 (Mac OS X Mail 6.5 \(1508\)) Subject: Re: CuratorFramework is not recovering connecion with ZK Date: Tue, 18 Jun 2013 11:55:15 -0700 References: <57AC621B-9933-4FE3-ADD5-220EFB28E0D9@jordanzimmerman.com> <1371219772.97240.YahooMailNeo@web172503.mail.ir2.yahoo.com> <1371236951.63729.YahooMailNeo@web172504.mail.ir2.yahoo.com> <0F81B064-AA0B-4338-AC00-4D61B2829162@jordanzimmerman.com> <1371239274.32290.YahooMailNeo@web172505.mail.ir2.yahoo.com> <56235476-7793-46B7-8EC3-D9525E4284B3@jordanzimmerman.com> <1371249314.59596.YahooMailNeo@web172505.mail.ir2.yahoo.com> <6CAE0EC9-23E1-4F09-9DBA-32CB1FA9F3C8@jordanzimmerman.com> <1371559927.68317.YahooMailNeo@web172504.mail.ir2.yahoo.com> To: user@curator.incubator.apache.org, =?iso-8859-1?Q?Evaristo_Jos=E9_Camarero?= In-Reply-To: <1371559927.68317.YahooMailNeo@web172504.mail.ir2.yahoo.com> X-Mailer: Apple Mail (2.1508) X-Gm-Message-State: ALoCoQmkOIVXKNU8i90eKvokSsqToqKYHGuK9LdjUu19wPdROZdPPRVZj7/bJruZmYhOj3zHg66w X-Virus-Checked: Checked by ClamAV on apache.org --Apple-Mail=_D1EF45B3-416E-4BB8-9D5C-79458D6C9001 Content-Transfer-Encoding: quoted-printable Content-Type: text/plain; charset=iso-8859-1 Interesting - I wonder if there's anything Curator can notice so that it = can reset the connection. -JZ On Jun 18, 2013, at 5:52 AM, Evaristo Jos=E9 Camarero = wrote: >=20 > After some investigation I think the problem is related with this = Zookeeper bug > https://issues.apache.org/jira/browse/ZOOKEEPER-832 >=20 > The ZK client is in an infinite loop trying to reconnect >=20 > De: Jordan Zimmerman > Para: Evaristo Jos=E9 Camarero =20 > CC: "user@curator.incubator.apache.org" = =20 > Enviado: S=E1bado 15 de junio de 2013 1:40 > Asunto: Re: CuratorFramework is not recovering connecion with ZK >=20 > I don't think it's a bug. If the connection is Lost there's not much = that can be done. Curator recovers if the connection can be restored on = the next method call. What were you expecting? >=20 > -Jordan >=20 > On Jun 14, 2013, at 3:35 PM, Evaristo Jos=E9 Camarero = wrote: >=20 >>=20 >> Should I open a JIRA? I am not really sure if the problem is in the = Curator or in the ZK client/server. >>=20 >> In this case according to the logs I have attached the ZK client it = is trying to reconnect but the server rejects forever, and Curator is = not aware of the issue (otherwise I assume that a new ZK client will be = used) or at least it is not exposing the problem in the API with some = kind of notification. >>=20 >> Is the workaround to this problem to have a background thread trying = to perform a TX when the connection has been notified as LOST to really = wake up the client again? I assume that API was not designed with this = approach in mind. >>=20 >> Thanks for the help, >>=20 >> Evaristo >>=20 >>=20 >>=20 >> De: Jordan Zimmerman >> Para: user@curator.incubator.apache.org; Evaristo Jos=E9 Camarero = =20 >> Enviado: Viernes 14 de junio de 2013 22:25 >> Asunto: Re: CuratorFramework is not recovering connecion with ZK >>=20 >> I think there needs to be some ZK-active methods to get the = reconnect. I just tried again and didn't always get the reconnect. = However, if I add this line after the second server start() it always = works: >>=20 >> zkClient.create().forPath("/hey"); >>=20 >> -JZ >>=20 >> On Jun 14, 2013, at 12:47 PM, Evaristo Jos=E9 Camarero = wrote: >>=20 >>>=20 >>> I have tested with Curator 2.0.1 , curator 2.0.2 branch, master. = Same result. Tested different timeouts and the result is the same. Let = me know the branch you are using and I can try again. >>>=20 >>> It looks that the issue is different in my opinion: >>> I have checked Zookeeper server logs, and this is what I have found = after starting again the server (Notice that is a server fresh start). >>>=20 >>> Looks that zk client is trying to reconnect but ZK client last seen = TX is zxid =3D 2, and last server zxid =3D0 because server has been = restarted (and TX log is lost). >>>=20 >>> 2013-06-14 21:37:46 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:46 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50436 >>> 2013-06-14 21:37:46 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50436 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:46 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50436 (no session established for = client) >>> 2013-06-14 21:37:46 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:47 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:47 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:47 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50437 >>> 2013-06-14 21:37:47 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50437 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:47 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50437 (no session established for = client) >>> 2013-06-14 21:37:47 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:49 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:49 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:49 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50438 >>> 2013-06-14 21:37:49 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50438 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:49 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50438 (no session established for = client) >>> 2013-06-14 21:37:49 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:50 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:50 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:50 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50439 >>> 2013-06-14 21:37:50 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50439 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:50 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50439 (no session established for = client) >>> 2013-06-14 21:37:50 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:52 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:52 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50440 >>> 2013-06-14 21:37:52 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:52 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50440 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:52 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50440 (no session established for = client) >>> 2013-06-14 21:37:52 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:54 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:54 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:54 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50441 >>> 2013-06-14 21:37:54 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50441 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:54 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50441 (no session established for = client) >>> 2013-06-14 21:37:54 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:55 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:55 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50442 >>> 2013-06-14 21:37:55 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:55 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50442 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:55 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50442 (no session established for = client) >>> 2013-06-14 21:37:55 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:56 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:56 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50443 >>> 2013-06-14 21:37:56 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:56 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50443 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:56 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50443 (no session established for = client) >>> 2013-06-14 21:37:56 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:37:58 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:37:58 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50444 >>> 2013-06-14 21:37:58 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:37:58 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50444 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:37:58 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50444 (no session established for = client) >>> 2013-06-14 21:37:58 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:38:00 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:38:00 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50445 >>> 2013-06-14 21:38:00 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:38:00 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50445 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:38:00 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50445 (no session established for = client) >>> 2013-06-14 21:38:00 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:38:02 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:38:02 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50446 >>> 2013-06-14 21:38:02 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:38:02 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50446 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:38:02 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50446 (no session established for = client) >>> 2013-06-14 21:38:02 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:38:04 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:38:04 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50447 >>> 2013-06-14 21:38:04 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:38:04 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50447 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:38:04 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50447 (no session established for = client) >>> 2013-06-14 21:38:04 INFO ClientCnxn:1085 - Unable to read = additional data from server sessionid 0x13f4430e2930000, likely server = has closed socket, closing socket connection and attempting reconnect >>> 2013-06-14 21:38:05 INFO ClientCnxn:966 - Opening socket connection = to server localhost.localdomain/127.0.0.1:40000. Will not attempt to = authenticate using SASL (unknown error) >>> 2013-06-14 21:38:05 INFO ClientCnxn:849 - Socket connection = established to localhost.localdomain/127.0.0.1:40000, initiating session >>> 2013-06-14 21:38:05 INFO NIOServerCnxnFactory:197 - Accepted socket = connection from /127.0.0.1:50448 >>> 2013-06-14 21:38:05 INFO ZooKeeperServer:812 - Refusing session = request for client /127.0.0.1:50448 as it has seen zxid 0x2 our last = zxid is 0x0 client must try another server >>> 2013-06-14 21:38:05 INFO NIOServerCnxn:1001 - Closed socket = connection for client /127.0.0.1:50448 (no session established for = client) >>> 2013-06 >>>=20 >>> The same result >>>=20 >>> Timing are not relevant in the tests. The ZK server is providing = this logs >>>=20 >>>=20 >>>=20 >>> De: Jordan Zimmerman >>> Para: Evaristo Jos=E9 Camarero =20 >>> CC: "user@curator.incubator.apache.org" = =20 >>> Enviado: Viernes 14 de junio de 2013 21:13 >>> Asunto: Re: CuratorFramework is not recovering connecion with ZK >>>=20 >>> Play with the timings a bit. I consistently see RECONNECTED. Also, = make sure you're using the latest Curator. There's a bug fix in 2.0.1 = that can affect this. >>>=20 >>> -JZ >>>=20 >>> On Jun 14, 2013, at 12:09 PM, Evaristo Jos=E9 Camarero = wrote: >>>=20 >>>> Hi there, >>>>=20 >>>> Thanks for the fast response. >>>>=20 >>>> This is the output of the program with your modifications: >>>>=20 >>>> 2013-06-14 21:05:01 INFO CuratorFrameworkImpl:221 - Starting >>>> 2013-06-14 21:05:01 INFO ConnectionStateManager:151 - State = change: CONNECTED >>>> NEw state CONNECTED >>>> 2013-06-14 21:05:13 INFO ConnectionStateManager:151 - State = change: SUSPENDED >>>> NEw state SUSPENDED >>>> 2013-06-14 21:05:16 ERROR CuratorFrameworkImpl:530 - Background = operation retry gave up >>>> org.apache.zookeeper.KeeperException$ConnectionLossException: = KeeperErrorCode =3D ConnectionLoss >>>> at = org.apache.zookeeper.KeeperException.create(KeeperException.java:99) >>>> at = org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOp= eration(CuratorFrameworkImpl.java:487) >>>> at = org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(Backg= roundSyncImpl.java:50) >>>> at = org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:6= 06) >>>> at = org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495) >>>> 2013-06-14 21:05:16 INFO ConnectionStateManager:151 - State = change: LOST >>>> NEw state LOST >>>>=20 >>>>=20 >>>> You can notice that: >>>> "NEW state RECONNECTED never appears" >>>>=20 >>>> what basically means that CuratorFramework listener is not called, = and therefore I assume that Curator does not reconnect. >>>>=20 >>>>=20 >>>>=20 >>>> Something very interesting is the fact that if you remove = "zkClient.create().forPath("/pedro", "juan".getBytes());" from the = program (basically you are not doing any TX, Curator is able to recover. >>>>=20 >>>> Regards, >>>>=20 >>>> Evaristo >>>>=20 >>>>=20 >>>>=20 >>>> De: Jordan Zimmerman >>>> Para: user@curator.incubator.apache.org; Evaristo Jos=E9 Camarero = =20 >>>> Enviado: Viernes 14 de junio de 2013 19:06 >>>> Asunto: Re: CuratorFramework is not recovering connecion with ZK >>>>=20 >>>> Curator uses the timeouts given when the CuratorFramework object is = created to manage the connection. In this instance, after LOST has been = received Curator may wait until session timeout before trying to = re-create the ZooKeeper object. I'm enclosing a modified version of your = test that sets the Curator timeouts to half the time that the test's = sleeps wait and you can see that Curator recovers correctly. >>>>=20 >>>> -Jordan >>>>=20 >>>>=20 >>>>=20 >>>> On Jun 14, 2013, at 7:22 AM, Evaristo Jos=E9 Camarero = wrote: >>>>=20 >>>>>=20 >>>>> Hi there, >>>>>=20 >>>>> I have found a case in which CuratorFramework is not recovering = connection with ZK servers. >>>>>=20 >>>>> The use case is the following: >>>>> - Start ZK servers >>>>> - Start application with CuratorFramework >>>>> - ZK servers goes down. >>>>> - ZK servers start again. >>>>> - CuratorFramework app it is not notified that connection is = reconnected , but neither notifies that can not recover connection in = any way, so application can not recover. Only option is to restart the = app. >>>>> Notice that if CuratorFRamework client is not making any = transaction (just comment create() TX), the client is able to reconnect. >>>>>=20 >>>>> I attach a program able to reproduce the problem >>>>>=20 >>>>> Regards, >>>>>=20 >>>>> Evaristo >>>>>=20 >>>>>=20 >>>>>=20 >>>>> >>>>=20 >>>>=20 >>>>=20 >>>=20 >>>=20 >>>=20 >>=20 >>=20 >>=20 >=20 >=20 >=20 --Apple-Mail=_D1EF45B3-416E-4BB8-9D5C-79458D6C9001 Content-Transfer-Encoding: quoted-printable Content-Type: text/html; charset=iso-8859-1 evaristo.camarero@yahoo.es&= gt; wrote:

After some = investigation I think the problem is related with this Zookeeper = bug
https://issues.apache.org/jira/browse/ZOOKEEPER-832
<= /div>
The ZK client is in an = infinite loop trying to reconnect


= De: Jordan Zimmerman <jordan@jordanzimmerman.com&= gt;
Para: Evaristo = Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt;
CC: "user@curator.incubator.a= pache.org" <user@curator.incubator.a= pache.org>
Enviado: S=E1bado 15 de junio de 2013 1:40
= Asunto: Re: = CuratorFramework is not recovering connecion with ZK
=

I don't = think it's a bug. If the connection is Lost there's not much that can be = done. Curator recovers if the connection can be restored on the next = method call. What were you = expecting?

-Jordan

On Jun 14, = 2013, at 3:35 PM, Evaristo Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt; wrote:


Should I open = a JIRA? I am not really sure if the problem is in the Curator or in the = ZK client/server.

In this case = according to the logs I have attached the ZK client it is trying to = reconnect  but the server rejects forever, and Curator is not aware = of the issue (otherwise I assume that a new ZK client will be used) or at least it is not exposing the problem = in the API with some kind of notification.

Is the = workaround to this problem to have a background thread trying to perform = a TX when the connection has been notified as LOST to really wake up the = client again? I assume that API was not designed with this approach in = mind.

Thanks for the help,

Evaristo




De: = Jordan Zimmerman <jordan@jordanzimmerman.com&= gt;
Para: user@curator.incubator.a= pache.org; Evaristo Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt;
Enviado: = Viernes 14 de junio de 2013 22:25
Asunto: Re: CuratorFramework is = not recovering connecion with ZK

I = think there needs to be some ZK-active methods to get the reconnect. I = just tried again and didn't always get the reconnect. However, if I add = this line after the second server start() it always = works:

        = zkClient.create().forPath("/hey");

-JZ
=
On Jun 14, 2013, at 12:47 PM, Evaristo Jos=E9 = Camarero <evaristo.camarero@yahoo.es&= gt; wrote:


I have tested = with Curator 2.0.1 , curator 2.0.2 branch, master. Same result. Tested = different timeouts and the result is the same. Let me know the branch = you are using and I can try again.

It looks that the issue is different in my opinion:
I have checked = Zookeeper server logs, and this is what I have found after starting = again the server (Notice that is a server fresh = start).

Looks that zk client = is trying to reconnect but ZK client last seen TX is zxid =3D 2, and = last server zxid =3D0 because server has been restarted (and TX log is = lost).

2013-06-14 21:37:46 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:46 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50436
2013-06-14 21:37:46 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50436 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:46 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50436 (no session established for client)
2013-06-14 21:37:46 INFO  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:47 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:47 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:47 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50437
2013-06-14 21:37:47 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50437 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:47 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50437 (no session established for client)
2013-06-14 21:37:47 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:49 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:49 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:49 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50438
2013-06-14 21:37:49 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50438 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:49 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50438 (no session established for client)
2013-06-14 21:37:49 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:50 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:50 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:50 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50439
2013-06-14 21:37:50 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50439 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:50 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50439 (no session established for client)
2013-06-14 21:37:50 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and attempting = reconnect
2013-06-14 = 21:37:52 INFO  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:52 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50440
2013-06-14 21:37:52 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:52 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50440 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:52 INFO  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50440 (no session established for client)
2013-06-14 21:37:52 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:54 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:54 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:54 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50441
2013-06-14 21:37:54 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50441 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:54 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50441 (no session established for client)
2013-06-14 21:37:54 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:55 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:55 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50442
2013-06-14 21:37:55 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:55 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50442 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:55 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50442 (no session established for client)
2013-06-14 21:37:55 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:56 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:56 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50443
2013-06-14 21:37:56 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:56 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50443 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:37:56 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50443 (no session established for client)
2013-06-14 21:37:56 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:37:58 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:37:58 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50444
2013-06-14 21:37:58 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:37:58 INFO =  ZooKeeperServer:812 - Refusing session request for client /127.0.0.1:50444 as it has seen zxid 0x2 our = last zxid is 0x0 client must try another server
2013-06-14 21:37:58 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50444 (no session established for client)
2013-06-14 21:37:58 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:38:00 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:38:00 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50445
2013-06-14 21:38:00 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:38:00 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50445 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:38:00 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50445 (no session established for client)
2013-06-14 21:38:00 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:38:02 INFO =  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000. Will not = attempt to authenticate using SASL (unknown error)
2013-06-14 21:38:02 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50446
2013-06-14 21:38:02 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:38:02 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50446 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:38:02 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50446 (no session established for client)
2013-06-14 21:38:02 INFO  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:38:04 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:38:04 INFO =  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50447
2013-06-14 21:38:04 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:38:04 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50447 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:38:04 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50447 (no session established for client)
2013-06-14 21:38:04 INFO =  ClientCnxn:1085 - Unable to read additional data from server = sessionid 0x13f4430e2930000, likely server has closed socket, closing = socket connection and attempting reconnect
2013-06-14 21:38:05 INFO =  ClientCnxn:966 - Opening socket connection to server = localhost.localdomain/127.0.0.1:40000. Will not attempt to authenticate = using SASL (unknown error)
2013-06-14 21:38:05 INFO =  ClientCnxn:849 - Socket connection established to = localhost.localdomain/127.0.0.1:40000, initiating session
2013-06-14 21:38:05 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from = /127.0.0.1:50448
2013-06-14 21:38:05 INFO =  ZooKeeperServer:812 - Refusing session request for client = /127.0.0.1:50448 as it has seen zxid 0x2 our last zxid is 0x0 client = must try another server
2013-06-14 21:38:05 INFO =  NIOServerCnxn:1001 - Closed socket connection for client = /127.0.0.1:50448 (no session established for client)
2013-06

The same result

Timing are not relevant in the = tests. The ZK server is providing this logs




= De: Jordan Zimmerman <jordan@jordanzimmerman.com&= gt;
Para: Evaristo = Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt;
CC: "user@curator.incubator.a= pache.org" <user@curator.incubator.a= pache.org>
Enviado: Viernes = 14 de junio de 2013 21:13
Asunto: Re: CuratorFramework is = not recovering connecion with ZK

Play = with the timings a bit. I consistently see RECONNECTED. Also, make sure = you're using the latest Curator. There's a bug fix in 2.0.1 that can = affect this.

-JZ

On Jun 14, = 2013, at 12:09 PM, Evaristo Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt; wrote:

Hi there,

Thanks for the fast = response.

This is the output of the = program with your modifications:

2013-06-14 21:05:01 INFO =  CuratorFrameworkImpl:221 - Starting
2013-06-14 21:05:01 INFO =  ConnectionStateManager:151 - State change: CONNECTED
NEw state CONNECTED
2013-06-14 21:05:13 INFO =  ConnectionStateManager:151 - State change: SUSPENDED
NEw state SUSPENDED
2013-06-14 = 21:05:16 ERROR CuratorFrameworkImpl:530 - Background operation retry = gave up
org.apache.zookeeper.KeeperExcepti= on$ConnectionLossException: KeeperErrorCode =3D ConnectionLoss
= at = org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
=
= at = org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOp= eration(CuratorFrameworkImpl.java:487)
= at = org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(Backg= roundSyncImpl.java:50)
= at = org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:6= 06)
= at = org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
=
2013-06-14 21:05:16 INFO =  ConnectionStateManager:151 - State change: LOST
NEw state LOST


You can notice that:
"NEW state = RECONNECTED never appears"

what basically means that = CuratorFramework listener is not called, and therefore I assume that = Curator does not reconnect.



Something very interesting is = the fact that if you remove "zkClient.create().forPath("/pedro", = "juan".getBytes());" from the program (basically you are not doing any = TX, Curator is able to recover.

Regards,

Evaristo



=
De: Jordan Zimmerman <jordan@jordanzimmerman.com&= gt;
Para: user@curator.incubator.a= pache.org; Evaristo Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt;
Enviado: = Viernes 14 de junio de 2013 19:06
Asunto: Re: CuratorFramework is = not recovering connecion with ZK

Curator uses the timeouts given when the CuratorFramework object is created to manage = the connection. In this instance, after LOST has been received Curator = may wait until session timeout before trying to re-create the ZooKeeper = object. I'm enclosing a modified version of your test that sets the = Curator timeouts to half the time that the test's sleeps wait and you = can see that Curator recovers = correctly.

-Jordan



On Jun 14, = 2013, at 7:22 AM, Evaristo Jos=E9 Camarero <evaristo.camarero@yahoo.es&= gt; wrote:


Hi there,

I have = found a case in which CuratorFramework is not recovering connection with = ZK servers.

The use case is the = following:
- Start ZK servers
- = Start application with CuratorFramework
- ZK servers goes = down.
- ZK servers start again.
- = CuratorFramework app it is not notified that connection is reconnected , = but neither notifies that can not recover connection in any way, so = application can not recover. Only option is to restart the = app.
Notice that if = CuratorFRamework client is not making any transaction (just comment = create() TX), the client is able to reconnect.

I attach a program able to reproduce the problem

Regards,

Evaristo



=
<CuratorFails.java>

<= /div>

=



=



=



=

= --Apple-Mail=_D1EF45B3-416E-4BB8-9D5C-79458D6C9001--