zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Unexpected behavior with Session Timeouts in Java Client
Date Thu, 21 Apr 2011 21:47:18 GMT
Scott,

Having your master enter a suspended state is fine, but it cannot act as
master during this time (because somebody else may have become master during
this time).

It is fine to enter a suspended mode, but the suspended master cannot commit
to any actions as a master.  Any transactions that it accepts must be
considered as expendable.  Usually that means that whoever sent the
transactions must retain them until the suspended master regains its senses
or relinquishes its master state.

The other question that comes up from your description is how your ZK
cluster works.  Do you have zookeeper split across data centers?

On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <scottfines@gmail.com> wrote:

> Ryan,
>
> That is a fair point in that I would have consistency of services--that is,
> that I would be pretty sure I'd only have one service running at a time.
> However, my particular application demands are such that just stopping and
> re-starting on disconnected events is not a good idea.
>
> What I'm writing is a connector between two data centers, where the
> measured
> latency is on the order of seconds, and each time a service connects, it
> must transfer (hopefully only a few) megabytes of data, which I've measured
> to take on the order of minutes. On the other hand, it is not unusual for
> us
> to receive a disconnected event every now and then, which is generally
> resolved on the order of milliseconds. Clearly, I don't want to recreate a
> minutes-long process every time we get a milliseconds-long disconnection
> which does not remove the service's existing leadership.
>
> So, when the leader receives a disconnected event, it queues up events to
> process, but holds on to its connections and continues to receive events
> while it waits for a connection to ZK to be re-established. If the
> connection to ZK comes back online within the session timeout window, then
> it will just turn processing back on as if nothing happened. However, if
> the
> session timeout happens, then the client must cut all of its connections
> and
> kill itself with fire, rather than overwrite what the next leader does.
> Then
> the next leader has to go through the expensive process of starting the
> service back up.
>
> Hopefully that will give some color for why I'm concerned about this
> situation.
>
> Thanks,
>
> Scott
>
> On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rckenned@gmail.com> wrote:
>
> > Scott:
> >
> >  the right answer in this case is for the leader to watch for the
> > "disconnected" event and shut down. If the connection re-establishes,
> > the leader should still be the leader (their ephemeral sequential node
> > should still be there), in which case it can go back to work. If the
> > connection doesn't re-establish, one of two things may happen…
> >
> > 1) Your leader stays in the disconnected state (because it's unable to
> > reconnect), meanwhile the zookeeper server expires the session
> > (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> > node and a new worker is promoted to leader.
> >
> > 2) Your leader quickly transitions to the expired state, the ephemeral
> > node is lost and a new worker is promoted to leader.
> >
> > In both cases, your initial leader should see a disconnected event
> > first. If it shuts down when it sees that event, you should be
> > relatively safe in thinking that you only have one worker going at a
> > time.
> >
> > Once your initial leader sees the expiration event, it can try to
> > reconnect to the ensemble, create the new ephemeral sequential node
> > and get back into the queue for being a leader.
> >
> > Ryan
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message