curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: CuratorFramework is not recovering connecion with ZK
Date Fri, 14 Jun 2013 23:40:11 GMT
I don't think it's a bug. If the connection is Lost there's not much that can be done. Curator
recovers if the connection can be restored on the next method call. What were you expecting?

-Jordan

On Jun 14, 2013, at 3:35 PM, Evaristo José Camarero <evaristo.camarero@yahoo.es> wrote:

> 
> Should I open a JIRA? I am not really sure if the problem is in the Curator or in the
ZK client/server.
> 
> In this case according to the logs I have attached the ZK client it is trying to reconnect
 but the server rejects forever, and Curator is not aware of the issue (otherwise I assume
that a new ZK client will be used) or at least it is not exposing the problem in the API with
some kind of notification.
> 
> Is the workaround to this problem to have a background thread trying to perform a TX
when the connection has been notified as LOST to really wake up the client again? I assume
that API was not designed with this approach in mind.
> 
> Thanks for the help,
> 
> Evaristo
> 
> 
> 
> De: Jordan Zimmerman <jordan@jordanzimmerman.com>
> Para: user@curator.incubator.apache.org; Evaristo José Camarero <evaristo.camarero@yahoo.es>

> Enviado: Viernes 14 de junio de 2013 22:25
> Asunto: Re: CuratorFramework is not recovering connecion with ZK
> 
> I think there needs to be some ZK-active methods to get the reconnect. I just tried again
and didn't always get the reconnect. However, if I add this line after the second server start()
it always works:
> 
>         zkClient.create().forPath("/hey");
> 
> -JZ
> 
> On Jun 14, 2013, at 12:47 PM, Evaristo José Camarero <evaristo.camarero@yahoo.es>
wrote:
> 
>> 
>> I have tested with Curator 2.0.1 , curator 2.0.2 branch, master. Same result. Tested
different timeouts and the result is the same. Let me know the branch you are using and I
can try again.
>> 
>> It looks that the issue is different in my opinion:
>> I have checked Zookeeper server logs, and this is what I have found after starting
again the server (Notice that is a server fresh start).
>> 
>> Looks that zk client is trying to reconnect but ZK client last seen TX is zxid =
2, and last server zxid =0 because server has been restarted (and TX log is lost).
>> 
>> 2013-06-14 21:37:46 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:46 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50436
>> 2013-06-14 21:37:46 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50436 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:46 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50436 (no session established for client)
>> 2013-06-14 21:37:46 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:47 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:47 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:47 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50437
>> 2013-06-14 21:37:47 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50437 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:47 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50437 (no session established for client)
>> 2013-06-14 21:37:47 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:49 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:49 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:49 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50438
>> 2013-06-14 21:37:49 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50438 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:49 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50438 (no session established for client)
>> 2013-06-14 21:37:49 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:50 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:50 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:50 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50439
>> 2013-06-14 21:37:50 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50439 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:50 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50439 (no session established for client)
>> 2013-06-14 21:37:50 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:52 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:52 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50440
>> 2013-06-14 21:37:52 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:52 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50440 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:52 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50440 (no session established for client)
>> 2013-06-14 21:37:52 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:54 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:54 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:54 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50441
>> 2013-06-14 21:37:54 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50441 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:54 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50441 (no session established for client)
>> 2013-06-14 21:37:54 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:55 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:55 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50442
>> 2013-06-14 21:37:55 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:55 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50442 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:55 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50442 (no session established for client)
>> 2013-06-14 21:37:55 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:56 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:56 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50443
>> 2013-06-14 21:37:56 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:56 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50443 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:56 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50443 (no session established for client)
>> 2013-06-14 21:37:56 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:37:58 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:37:58 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50444
>> 2013-06-14 21:37:58 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:37:58 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50444 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:37:58 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50444 (no session established for client)
>> 2013-06-14 21:37:58 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:38:00 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:38:00 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50445
>> 2013-06-14 21:38:00 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:38:00 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50445 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:38:00 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50445 (no session established for client)
>> 2013-06-14 21:38:00 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:38:02 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:38:02 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50446
>> 2013-06-14 21:38:02 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:38:02 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50446 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:38:02 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50446 (no session established for client)
>> 2013-06-14 21:38:02 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:38:04 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:38:04 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50447
>> 2013-06-14 21:38:04 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:38:04 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50447 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:38:04 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50447 (no session established for client)
>> 2013-06-14 21:38:04 INFO  ClientCnxn:1085 - Unable to read additional data from server
sessionid 0x13f4430e2930000, likely server has closed socket, closing socket connection and
attempting reconnect
>> 2013-06-14 21:38:05 INFO  ClientCnxn:966 - Opening socket connection to server localhost.localdomain/127.0.0.1:40000.
Will not attempt to authenticate using SASL (unknown error)
>> 2013-06-14 21:38:05 INFO  ClientCnxn:849 - Socket connection established to localhost.localdomain/127.0.0.1:40000,
initiating session
>> 2013-06-14 21:38:05 INFO  NIOServerCnxnFactory:197 - Accepted socket connection from
/127.0.0.1:50448
>> 2013-06-14 21:38:05 INFO  ZooKeeperServer:812 - Refusing session request for client
/127.0.0.1:50448 as it has seen zxid 0x2 our last zxid is 0x0 client must try another server
>> 2013-06-14 21:38:05 INFO  NIOServerCnxn:1001 - Closed socket connection for client
/127.0.0.1:50448 (no session established for client)
>> 2013-06
>> 
>> The same result
>> 
>> Timing are not relevant in the tests. The ZK server is providing this logs
>> 
>> 
>> 
>> De: Jordan Zimmerman <jordan@jordanzimmerman.com>
>> Para: Evaristo José Camarero <evaristo.camarero@yahoo.es> 
>> CC: "user@curator.incubator.apache.org" <user@curator.incubator.apache.org>

>> Enviado: Viernes 14 de junio de 2013 21:13
>> Asunto: Re: CuratorFramework is not recovering connecion with ZK
>> 
>> Play with the timings a bit. I consistently see RECONNECTED. Also, make sure you're
using the latest Curator. There's a bug fix in 2.0.1 that can affect this.
>> 
>> -JZ
>> 
>> On Jun 14, 2013, at 12:09 PM, Evaristo José Camarero <evaristo.camarero@yahoo.es>
wrote:
>> 
>>> Hi there,
>>> 
>>> Thanks for the fast response.
>>> 
>>> This is the output of the program with your modifications:
>>> 
>>> 2013-06-14 21:05:01 INFO  CuratorFrameworkImpl:221 - Starting
>>> 2013-06-14 21:05:01 INFO  ConnectionStateManager:151 - State change: CONNECTED
>>> NEw state CONNECTED
>>> 2013-06-14 21:05:13 INFO  ConnectionStateManager:151 - State change: SUSPENDED
>>> NEw state SUSPENDED
>>> 2013-06-14 21:05:16 ERROR CuratorFrameworkImpl:530 - Background operation retry
gave up
>>> org.apache.zookeeper.KeeperException$ConnectionLossException: KeeperErrorCode
= ConnectionLoss
>>> 	at org.apache.zookeeper.KeeperException.create(KeeperException.java:99)
>>> 	at org.apache.curator.framework.imps.CuratorFrameworkImpl.processBackgroundOperation(CuratorFrameworkImpl.java:487)
>>> 	at org.apache.curator.framework.imps.BackgroundSyncImpl$1.processResult(BackgroundSyncImpl.java:50)
>>> 	at org.apache.zookeeper.ClientCnxn$EventThread.processEvent(ClientCnxn.java:606)
>>> 	at org.apache.zookeeper.ClientCnxn$EventThread.run(ClientCnxn.java:495)
>>> 2013-06-14 21:05:16 INFO  ConnectionStateManager:151 - State change: LOST
>>> NEw state LOST
>>> 
>>> 
>>> You can notice that:
>>> "NEW state RECONNECTED never appears"
>>> 
>>> what basically means that CuratorFramework listener is not called, and therefore
I assume that Curator does not reconnect.
>>> 
>>> 
>>> 
>>> Something very interesting is the fact that if you remove "zkClient.create().forPath("/pedro",
"juan".getBytes());" from the program (basically you are not doing any TX, Curator is able
to recover.
>>> 
>>> Regards,
>>> 
>>> Evaristo
>>> 
>>> 
>>> 
>>> De: Jordan Zimmerman <jordan@jordanzimmerman.com>
>>> Para: user@curator.incubator.apache.org; Evaristo José Camarero <evaristo.camarero@yahoo.es>

>>> Enviado: Viernes 14 de junio de 2013 19:06
>>> Asunto: Re: CuratorFramework is not recovering connecion with ZK
>>> 
>>> Curator uses the timeouts given when the CuratorFramework object is created to
manage the connection. In this instance, after LOST has been received Curator may wait until
session timeout before trying to re-create the ZooKeeper object. I'm enclosing a modified
version of your test that sets the Curator timeouts to half the time that the test's sleeps
wait and you can see that Curator recovers correctly.
>>> 
>>> -Jordan
>>> 
>>> 
>>> 
>>> On Jun 14, 2013, at 7:22 AM, Evaristo José Camarero <evaristo.camarero@yahoo.es>
wrote:
>>> 
>>>> 
>>>> Hi there,
>>>> 
>>>> I have found a case in which CuratorFramework is not recovering connection
with ZK servers.
>>>> 
>>>> The use case is the following:
>>>> - Start ZK servers
>>>> - Start application with CuratorFramework
>>>> - ZK servers goes down.
>>>> - ZK servers start again.
>>>> - CuratorFramework app it is not notified that connection is reconnected
, but neither notifies that can not recover connection in any way, so application can not
recover. Only option is to restart the app.
>>>> Notice that if CuratorFRamework client is not making any transaction (just
comment create() TX), the client is able to reconnect.
>>>> 
>>>> I attach a program able to reproduce the problem
>>>> 
>>>> Regards,
>>>> 
>>>> Evaristo
>>>> 
>>>> 
>>>> 
>>>> <CuratorFails.java>
>>> 
>>> 
>>> 
>> 
>> 
>> 
> 
> 
> 


Mime
View raw message