zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Camille Fournier (JIRA)" <j...@apache.org>
Subject [jira] Commented: (ZOOKEEPER-922) enable faster timeout of sessions in case of unexpected socket disconnect
Date Mon, 10 Jan 2011 22:31:45 GMT

    [ https://issues.apache.org/jira/browse/ZOOKEEPER-922?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12979814#action_12979814

Camille Fournier commented on ZOOKEEPER-922:

Ok, here's a recap of what the problem is, what the boundaries of the problem are from my
point of view, and what the current solution proposed above lacks. If the boundaries of the
problem and solution are unacceptable from the POV of the rest of the community, then I guess
we're at an impasse. So please take some time to read:

Problem description: When a client crashes, its ephemeral nodes need to wait until the negotiated
session timeout has passed before they will be removed by the leader. 

We would like for clients that have crashed to have their ephemeral nodes removed quickly.
The duration of visibility of "stale" ephemeral nodes (that is, those that were created by
now-dead clients) is directly correlated to the window of time in which the system is in an
incorrect state, for the purposes of our use case (dynamic discovery).

Without changing the code at all, we could simulate this by lowering the session timeout for
all nodes. However, that would cause a different kind of possible inconsistent system state.
For one, if there are clients that are doing long full-pause Garbage Collection, their sessions
will time out despite the fact that they are actually still alive (a very real likelihood
in our working environment). In another case, if one of the ensemble members dies and clients
have to fail over, a very short session timeout could also result in prematurely-killed sessions
for otherwise live clients. We would like to be able to detect likely cases of client crash
and clean up their sessions quickly, while having a longer session timeout for clients we
believe to be connected. 

We are willing to tolerate both a small number of false positives (believing clients crashed
when they are alive) as well as a small number of false negatives (believing clients alive,
and waiting for the full session timeout before removing them, when they have crashed). Given
the nature of systems and networks, it is impossible to tell 100% of the time whether a client
is truly alive or dead (a switch could crash, the client could GC, etc), and the occasional
missed guess is acceptable so long as the system otherwise retains the general coherence and
correctness guarantees.

Any solution to this problem must retain the ability for client sessions to migrate between
ensemble members in the case where the client sees a disconnection from the ZK cluster due
to the ensemble member crashing . 

Current system fundamentals:

The only way that a server can "see" a client crash is through an error that causes the socket
to close and throw an exception (NIOServerCnxn:doIO). If a client crashes without this socket
closing (say, by having the network cable to that server pulled), the server will not see
a socket close and will have to time out normally. This is an acceptable edge condition from
our point of view.

Additionally, it is possible that a server will "see" a client crash when in fact the socket
was closed unexpectedly on both ends, due to a scenario like a network switch failure. This
would result in a false positive crash detection by the zk server, and possibly result in
the client's session being timed out before the client has a chance to fail over to a different
server. This is also an acceptable edge condition from our point of view.

The session timeouts are controlled by the SessionTracker, which is maintained by the current
leader. That tracker table is updated every time the leader receives a record of pings from
its followers. Sessions are associated with an "owner", which is the current ensemble member
thought to be maintaining the session, however, the "owner" is not checked in the case of
a ping. 


When we see an exception on the socket resulting in a socket close, we lower the timeout for
the session associated with that connection. If the client does not reconnect within this
shortened window, the session is timed out and ephemeral state is removed.

The simplest version of the change can be seen in the first patch submitted to ZOOKEEPER-922.
This change does the following:

In NIOServerCnxn:doIO, when an exception is caught that is not from the client explicitly
calling close, instead of just closing the connection, we "touch" the SessionTracker with
a timeout set by the user (minSessionTimeout), then close the connection.

This results in one of two workflows. Followers will insert the sessionId and sessionTimeout
in their touchTable, which will be sent to the Leader on the next ping. The Leader will then
call SessionTrackerImpl.touchSession. In the case of the leader being the one to touchAndClose,
it will directly call SessionTrackerImpl.touchSession, which has been modified to allow the
session to have its expiration time set lower as well as higher.

These changes have been verified to produce a functional (not necessarily bug-free) implementation
of the desired spec.

Possible issues:

1. If a client and a server each see a socket disconnection due to a network switch failure,
the client will have a shorter time window in which to fail over to a different server before
its session is timed out. This is fine with me, but since the shorter timeout is configurable,
for those users for whom this risk is not worth the benefit, setting the minSessionTimeout
to be the same as the negotiated session timeout will mitigate this problem. Therefore I'm
not going to attempt to fix this.

2. If a client and a server both see disconnections but the client manages to fail over and
migrate its session before the original server sends its session tracker update with the reduced
session timeout, the client could potentially have its session timed out if it does not heartbeat
in the reduced timeout window, despite having failed over. This reduced timeout window would
only last until the new zk ensemble member re-pinged for that client, but there is a window
of vulnerability. This could be fixed before making this change.

Are we all on the same page so far with this? All I want to do is enable fast failing for
those who want it, if they are willing to accept the possibility that certain network failures
could cause over-aggressive session timeout for clients that are not actually dead. 

> enable faster timeout of sessions in case of unexpected socket disconnect
> -------------------------------------------------------------------------
>                 Key: ZOOKEEPER-922
>                 URL: https://issues.apache.org/jira/browse/ZOOKEEPER-922
>             Project: ZooKeeper
>          Issue Type: Improvement
>          Components: server
>            Reporter: Camille Fournier
>            Assignee: Camille Fournier
>             Fix For: 3.4.0
>         Attachments: ZOOKEEPER-922.patch
> In the case when a client connection is closed due to socket error instead of the client
calling close explicitly, it would be nice to enable the session associated with that client
to time out faster than the negotiated session timeout. This would enable a zookeeper ensemble
that is acting as a dynamic discovery provider to remove ephemeral nodes for crashed clients
quickly, while allowing for a longer heartbeat-based timeout for java clients that need to
do long stop-the-world GC. 
> I propose doing this by setting the timeout associated with the crashed session to "minSessionTimeout".

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message