zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Session closing delay issue
Date Fri, 08 Jun 2012 22:11:16 GMT
On Thu, Jun 7, 2012 at 11:38 AM, Thawan Kooburat <thawan@fb.com> wrote:
> We have a Zookeeper ensemble that spend across multiple data centers (each participant
is in a different datacenter). Recently, we ran into an issue when trying to support low session
time (5 seconds). We set tickTime to be 2 seconds and syncLimit to 25.
>

syncLimit of 25 with tickTime of 2000 means that you are allowing the
followers to run up to 50 seconds behind the leader.
http://zookeeper.apache.org/doc/r3.4.3/zookeeperAdmin.html#sc_clusterOptions

> The using case is a single master. We can only have one master at any given time. The
active master create an ephemeral node. The backup master watch of this ephemeral node to
be deleted before it take over the master role.
>
> The active master is connecting to the follower (F1) in its data center. We believe that
a network delay between F1 and the leader cause the touchTable to not propagate in a timely
manner. The leader decide to close the session due to timeout.  Ephemeral node delete event
reach the other follower (F2) before the close session event reach F1. The backup master which
is connecting to F2  got the ephemeral delete and assume the role of the active master.
>
> From our log,  the active master saw session expire event 14 seconds after the backup
master receive ephemeral node delete event.
>

This is a consequence of setting the session timeout lower than the syncLimit.

a) the leader will expire the session after 5 seconds of not hearing
from the client
b) the follower F1 can run up to 50 seconds behind the leader. i.e. no
communication btw the follower and leader, incl client heartbeat
updates
c) let's say that F2 has perfect communication

in which case the leader might decide that the session is expired and
notify the followers. F2 gets the result quickly, F1 does not.

Typically what happens is that the follower will fall out of the
quorum before the session has a chance to expire, at which point the
client will get disconnected from the follower immediately (follower
out of quorum closes all client connections until it's able to
rejoin).

> I tried to looked at code, but from my current understanding. We don't have logic that
enforce upper bound in which a particular follower can lag behind (in term of data tree processing).
This means some part of the system may see that the lock is release is before the previous
owner release them.
>

see org.apache.zookeeper.server.quorum.LearnerHandler.synced() called
from org.apache.zookeeper.server.quorum.Leader.lead()

There is no guarantee that all clients see the events at the same
time. Only that they see them in the same order. There's always a
possibility of a race where the client on F2 sees the znode removed
before the client on F1. This effect is magnified in a cross DC
scenario. Also consider there is a lag btw server/client communication
as well.

Have you looked at ZooKeeper.sync? This ensures that the follower is
up to date with the leader (at the time sync is processed). This may
or may not allow you to resolve the problem for this particular use
case though... (the syncLimit vs timeout being the key issue)

> Another issue that I saw is in this case that,  the client maintains internal clock
on when its session should expire based on its connectivity with the follow. However, the
leader internal clock (session tracker) use information that get relayed from the follower
via touchTable.  As a result, the both party may decide when the session is expired differently
if there are network issue between follower and leader.
>

The client only tracks when it should disconnect from the server, this
is not involved with session expiration per se. The Leader is tracking
session expiration relative to the last time he heard a heartbeat from
the client (max gap being the session timeout).

Patrick

Mime
View raw message