zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Fines <scottfi...@gmail.com>
Subject Re: Unexpected behavior with Session Timeouts in Java Client
Date Thu, 21 Apr 2011 20:45:23 GMT
Ryan,

That is a fair point in that I would have consistency of services--that is,
that I would be pretty sure I'd only have one service running at a time.
However, my particular application demands are such that just stopping and
re-starting on disconnected events is not a good idea.

What I'm writing is a connector between two data centers, where the measured
latency is on the order of seconds, and each time a service connects, it
must transfer (hopefully only a few) megabytes of data, which I've measured
to take on the order of minutes. On the other hand, it is not unusual for us
to receive a disconnected event every now and then, which is generally
resolved on the order of milliseconds. Clearly, I don't want to recreate a
minutes-long process every time we get a milliseconds-long disconnection
which does not remove the service's existing leadership.

So, when the leader receives a disconnected event, it queues up events to
process, but holds on to its connections and continues to receive events
while it waits for a connection to ZK to be re-established. If the
connection to ZK comes back online within the session timeout window, then
it will just turn processing back on as if nothing happened. However, if the
session timeout happens, then the client must cut all of its connections and
kill itself with fire, rather than overwrite what the next leader does. Then
the next leader has to go through the expensive process of starting the
service back up.

Hopefully that will give some color for why I'm concerned about this
situation.

Thanks,

Scott

On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rckenned@gmail.com> wrote:

> Scott:
>
>  the right answer in this case is for the leader to watch for the
> "disconnected" event and shut down. If the connection re-establishes,
> the leader should still be the leader (their ephemeral sequential node
> should still be there), in which case it can go back to work. If the
> connection doesn't re-establish, one of two things may happen…
>
> 1) Your leader stays in the disconnected state (because it's unable to
> reconnect), meanwhile the zookeeper server expires the session
> (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> node and a new worker is promoted to leader.
>
> 2) Your leader quickly transitions to the expired state, the ephemeral
> node is lost and a new worker is promoted to leader.
>
> In both cases, your initial leader should see a disconnected event
> first. If it shuts down when it sees that event, you should be
> relatively safe in thinking that you only have one worker going at a
> time.
>
> Once your initial leader sees the expiration event, it can try to
> reconnect to the ensemble, create the new ephemeral sequential node
> and get back into the queue for being a leader.
>
> Ryan
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message