curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Tschetter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-36) Bad session, infinite connection loop from Curator
Date Mon, 23 Sep 2013 16:36:05 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-36?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13774704#comment-13774704
] 

Eric Tschetter commented on CURATOR-36:
---------------------------------------

Ok, I've looked at this and I'm not sure if something is actually possible.

Basically, the ConnectionStateManager appears to be the only thing that can see the churn
that ZooKeeper is doing.  It has enough information to be able to decide that it would be
best to kill the current ZooKeeper instance and make a new one, but I don't believe it is
connected up to the right things in order to be able to actually act on that decision.

>From chatting with Jordan, it looks like the decision needs to happen on ConnectionState,
but ConnectionState belongs to CuratorZookeeperClient and ConnectionStateManager belongs to
CuratorFrameworkImpl, which also has the instance of CuratorZookeeperClient.

So, I'm thinking that this change will require a lot of surgery on the internals to make it
actually work.  Given that it's just a work-around for a ZooKeeper issue in the first place.
 Perhaps it is not work doing?

Jordan, any thoughts on how ConnectionStateManager could communicate with ConnectionState
such that ConnectionState can choose to kill the Zookeeper instance?  Please assign back to
me with comments if you have an idea for how this can be done.  Otherwise, let's just close
it as Not Fixed and move on.
                
> Bad session, infinite connection loop from Curator
> --------------------------------------------------
>
>                 Key: CURATOR-36
>                 URL: https://issues.apache.org/jira/browse/CURATOR-36
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Framework
>    Affects Versions: 2.0.1-incubating
>            Reporter: Eric Tschetter
>            Assignee: Eric Tschetter
>
> On the ZK clients that I am running Curator on, we sometimes see reconnect loops like
the following.  These are infinite and happen until the process is restarted.
> 2013-06-18 19:57:28,660 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED
> 2013-06-18 19:57:28,660 WARN [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager
- ConnectionStateManager queue full - dropping events to make room
> 2013-06-18 19:57:28,786 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager
- State change: SUSPENDED
> 2013-06-18 19:57:28,786 WARN [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager
- ConnectionStateManager queue full - dropping events to make room
> 2013-06-18 19:57:29,048 INFO [main-SendThread(ip-10:2181)] org.apache.zookeeper.ClientCnxn
- Opening socket connection to server ip-10/10.:2181. Will not attempt to authenticate using
SASL (Unable to locate a login configuration)
> 2013-06-18 19:57:29,049 INFO [main-SendThread(ip-10:2181)] org.apache.zookeeper.ClientCnxn
- Socket connection established to ip-10/10.:2181, initiating session
> 2013-06-18 19:57:29,160 WARN [main-SendThread(ip-10:2181)] org.apache.zookeeper.ClientCnxnSocket
- Connected to an old server; r-o mode will be unavailable
> 2013-06-18 19:57:29,160 INFO [main-SendThread(ip-10:2181)] org.apache.zookeeper.ClientCnxn
- Session establishment complete on server ip-10/10.:2181, sessionid = 0x63f5865925e0010,
negotiated timeout = 30000
> 2013-06-18 19:57:29,177 INFO [main-EventThread] org.apache.curator.framework.state.ConnectionStateManager
- State change: RECONNECTED
> Looking on the ZK side, it looks like
> 2013-06-18 20:07:31,215 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1580]
- Established session 0x63f5865925e0010 with negotiated timeout 30000 for client /10.:56263
> 2013-06-18 20:07:31,324 - WARN  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@639]
- Exception causing close of session 0x63f5865925e0010 due to java.io.IOException: Len error
6736057
> 2013-06-18 20:07:31,325 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxn@1435]
- Closed socket connection for client /10.:56263 which had sessionid 0x63f5865925e0010
> So, there appears to be some issue with trying to recover the session.  I don't know
exactly what is causing that issue recovering the session, but it would be awesome if Curator
were able to notice that it's failing at getting its session back and just try to make a brand
new connection.
> It appears like this might be doable in reaction to the ConnectionStateManager queue
filling up?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message