zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Germán Blanco (JIRA) <j...@apache.org>
Subject [jira] [Commented] (ZOOKEEPER-1382) Zookeeper server holds onto dead/expired session ids in the watch data structures
Date Wed, 25 Sep 2013 15:34:09 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-1382?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13777620#comment-13777620
] 

Germán Blanco commented on ZOOKEEPER-1382:
------------------------------------------

I would like to propose to skip the tests cases for this JIRA.
The modification required is minimal and it seems clearly correct. However, trying to build
a test that works and that doesn't require future maintenance seems to be very complex.
I have done the mock classes as suggested by Flavio above, but the tests don't work any more
(at least in trunk). My guess is that it is because of latest changes in the code. I don't
see an easy fix (at least easy for me), and what is worse, it seems to me that these test
cases will require review very often, since they depend very deeply on how things are implemented.
                
> Zookeeper server holds onto dead/expired session ids in the watch data structures
> ---------------------------------------------------------------------------------
>
>                 Key: ZOOKEEPER-1382
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-1382
>             Project: ZooKeeper
>          Issue Type: Bug
>          Components: server
>    Affects Versions: 3.4.5
>            Reporter: Neha Narkhede
>            Assignee: Neha Narkhede
>            Priority: Critical
>             Fix For: 3.4.6
>
>         Attachments: ZOOKEEPER-1382_3.3.4.patch, ZOOKEEPER-1382-branch-3.4.patch, ZOOKEEPER-1382.patch
>
>
> I've observed that zookeeper server holds onto expired session ids in the watcher data
structures. The result is the wchp command reports session ids that cannot be found through
cons/dump and those expired session ids sit there maybe until the server is restarted. Here
are snippets from the client and the server logs that lead to this state, for one particular
session id 0x134485fd7bcb26f -
> There are 4 servers in the zookeeper cluster - 223, 224, 225 (leader), 226 and I'm using
ZkClient to connect to the cluster
> From the application log -
> application.log.2012-01-26-325.gz:2012/01/26 04:56:36.177 INFO [ClientCnxn] [main-SendThread(223.prod:12913)]
[application Session establishment complete on server 223.prod/172.17.135.38:12913, sessionid
= 0x134485fd7bcb26f, negotiated timeout = 6000
> application.log.2012-01-27.gz:2012/01/27 09:52:37.714 INFO [ClientCnxn] [main-SendThread(223.prod:12913)]
[application] Client session timed out, have not heard from server in 9827ms for sessionid
0x134485fd7bcb26f, closing socket connection and attempting reconnect
> application.log.2012-01-27.gz:2012/01/27 09:52:38.191 INFO [ClientCnxn] [main-SendThread(226.prod:12913)]
[application] Unable to reconnect to ZooKeeper service, session 0x134485fd7bcb26f has expired,
closing socket connection
> On the leader zk, 225 -
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  [SessionTracker:ZooKeeperServer@314]
- Expiring session 0x134485fd7bcb26f, timeout of 6000ms exceeded
> zookeeper.log.2012-01-27-leader-225.gz:2012-01-27 09:52:34,010 - INFO  [ProcessThread:-1:PrepRequestProcessor@391]
- Processed session termination for sessionid: 0x134485fd7bcb26f
> On the server, the client was initially connected to, 223 -
> zookeeper.log.2012-01-26-223.gz:2012-01-26 04:56:36,173 - INFO  [CommitProcessor:1:NIOServerCnxn@1580]
- Established session 0x134485fd7bcb26f with negotiated timeout 6000 for client /172.17.136.82:45020
> zookeeper.log.2012-01-27-223.gz:2012-01-27 09:52:34,018 - INFO  [CommitProcessor:1:NIOServerCnxn@1435]
- Closed socket connection for client /172.17.136.82:45020 which had sessionid 0x134485fd7bcb26f
> Here are the log snippets from 226, which is the server, the client reconnected to, before
getting session expired event -
> 2012-01-27 09:52:38,190 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@770]
- Client attempting to renew session 0x134485fd7bcb26f at /172.17.136.82:49367
> 2012-01-27 09:52:38,191 - INFO  [QuorumPeer:/0.0.0.0:12913:NIOServerCnxn@1573] - Invalid
session 0x134485fd7bcb26f for client /172.17.136.82:49367, probably expired
> 2012-01-27 09:52:38,191 - INFO  [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:12913:NIOServerCnxn@1435]
- Closed socket connection for client /172.17.136.82:49367 which had sessionid 0x134485fd7bcb26f
> wchp output from 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*wchp* | wc
-l
> 3
> wchp output from 223, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*wchp* | wc
-l
> 0
> cons output from 223 and 226, taken on 01/30 -
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *226.*cons* | wc
-l
> 0
> nnarkhed-ld:zk-cons-wchp-2012013000 nnarkhed$ grep 0x134485fd7bcb26f *223.*cons* | wc
-l
> 0
> So, what seems to have happened is that the client was able to re-register the watches
on the new server (226), after it got disconnected from 223, inspite of having an expired
session id. 
> In NIOServerCnxn, I saw that after suspecting that a session is expired, a server removes
the cnxn and its watches from its internal data structures. But before that it allows more
requests to be processed even if the session is expired -
>             // Now that the session is ready we can start receiving packets
>             synchronized (this.factory) {
>                 sk.selector().wakeup();
>                 enableRecv();
>             }
>         } catch (Exception e) {
>             LOG.warn("Exception while establishing session, closing", e);
>             close();
>         }
> I wonder if the client somehow sneaked in the set watches, right after the server removed
the connection through removeCnxn() API ?

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message