zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Fines <scottfi...@gmail.com>
Subject Re: Unexpected behavior with Session Timeouts in Java Client
Date Thu, 21 Apr 2011 22:41:16 GMT
Perhaps I am not being clear in my description.

I'm building a system, that receives data events from an external source.
This external source is not necessarily under my control, and starting up a
connection to that source is highly expensive, as it entails a high-latency,
low-bandwith transfer of data. It's more stable than packet radio, but not a
whole lot faster. This system retains the ability to recreate events, and
can do so upon request, but the cost to recreate is extremely high.

On the other end, the system pushes these events on to distributed
processing and (eventually) a long-term storage situation, like Cassandra or
Hadoop. Each event that is received can be idempotently applied, so the
system can safely process duplicate messages if they come in.

If the world were perfect and we had Infiniband connections to all of our
external sources, then there would be no reason for a leader-election
protocol in this scenario. I would just boot the system on every node, and
have them do their thing, and why worry? Idempotency is a beautiful thing.
Sadly, the world is not perfect, and trying to deal with an already slow
external connection by asking it to send the same data 10 or 15 times is not
a great idea, performance-wise. In addition to slowing everything down on
the receiving end, it also has an adverse affect on the source's
performance; the source, it must be noted, has other things to do besides
just feeding my system data.

So my solution is to limit the number of external connection to 1, and use
ZooKeeper leader-elections to manage which machine is running at which time.
This way, we keep the number of external connections down as low as we go,
we can guarantee that messages are received and processed idempotently, and
in the normal situation where there is no trouble at all, life is fine.

What I am trying to deal with right now is how to manage the corner cases of
when communication with ZooKeeper breaks down.

To answer your question about the ZooKeeper cluster installation: no, it is
not located in multiple data centers. It is, however, co-located with other
processes. For about 90-95% (we have an actual measurement, but I can't
remember it off the top of my head) of the time, the resource utilization is
low enough and ZooKeeper is lightweight enough that it makes sense to
co-locate. Occasionally, however, we do see a spike in an individual
machine's utilization. Even more occasionally, that spike can result in
clients being disconnected from that ZooKeeper node. Since almost all the
remainder of the cluster is reachable and appropriately utilized, clients
typically reconnect to another node, and all is well. Of course, with this
arrangement, there is a small chance that the entire cluster could blow out
in usage, which would result in catastrophic ZooKeeper failure. This hasn't
happened yet, but I'm sure it's only a matter of time. However, if this were
to happen, the co-located services would also be at the tipping point, and
would probably crash soon after. Since that system is mission-critical, we
have lots of monitoring in place, as well as a couple of spare nodes to
throw at the cluster in case the overall usage is high. So, in short, it
isn't terribly surprising that we might occasionally get a connection loss
from ZooKeeper, but in almost all cases, that connection loss can be
resolved very quickly.

What can NOT be resolved quickly is our leader failing. This is not
catastrophic, but it does negatively impact performance. Therefore, I wish
to avoid needlessly creating external connections. If we shut down the
external connection upon every connection loss, we would pay a price in
minutes what could easily be resolved in milliseconds. If, however, we wait,
keeping a connection open, knowing that we are still probabilistically the
leader, we can continue to receive events. Once the leadership has been
passed on, however, we should shut down to avoid excess connections.

What's puzzling to me here is the client expectation. On the one hand, there
is a SessionExpired Event, which clearly indicates that a session has
expired. However, the session only expires if ZooKeeper fails to hear from
you within a specified time frame. This specified time is agreed upon, and
is known by both ZK and the client. Therefore, one expectation would be that
the SessionExpired event would be fired when that timeframe has been
exhausted before a connection with ZK could be reestablished. On the other
hand, Session Expired events only happen AFTER a reconnection to the
cluster. This is strange to me, because it seems that the client has all the
information it needs to know that it's session has been expired by
ZooKeeper, and it doesn't really need to first talk to the Server to pass
that information back to the application. It's made particularly strange to
me when I think that, for most cases, a failure to talk to ZooKeeper within
the session timeout probably means that you aren't going to be able to
reconnect with it for a lot longer yet, and client is probably going to just
keep failing. If you were talking about total machine failure, then who
cares? The application is dead anyway. But if you're talking about a network
partition, where the node could conceivably still do its job successfully,
and is only prevented from doing so by a connection to ZooKeeper, as in my
use case, you want that information in a reasonably timely manner.

Judging from my questions, the relative number of questions about
SessionExpiration versus disconnect that have been answered, and the fact
that Twitter and Travis (and presumably others as well) have felt the need
to implement this behavior outside of the system, it seems like I'm not the
only one confused about this. So my question is: Why does the ZooKeeper
client do notification only upon reconnect? Is there a
performance/consistency reason that I'm not seeing?

Thanks, and sorry for the uber-long message,


On Thu, Apr 21, 2011 at 4:47 PM, Ted Dunning <ted.dunning@gmail.com> wrote:

> Scott,
> Having your master enter a suspended state is fine, but it cannot act as
> master during this time (because somebody else may have become master
> during
> this time).
> It is fine to enter a suspended mode, but the suspended master cannot
> commit
> to any actions as a master.  Any transactions that it accepts must be
> considered as expendable.  Usually that means that whoever sent the
> transactions must retain them until the suspended master regains its senses
> or relinquishes its master state.
> The other question that comes up from your description is how your ZK
> cluster works.  Do you have zookeeper split across data centers?
> On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <scottfines@gmail.com> wrote:
> > Ryan,
> >
> > That is a fair point in that I would have consistency of services--that
> is,
> > that I would be pretty sure I'd only have one service running at a time.
> > However, my particular application demands are such that just stopping
> and
> > re-starting on disconnected events is not a good idea.
> >
> > What I'm writing is a connector between two data centers, where the
> > measured
> > latency is on the order of seconds, and each time a service connects, it
> > must transfer (hopefully only a few) megabytes of data, which I've
> measured
> > to take on the order of minutes. On the other hand, it is not unusual for
> > us
> > to receive a disconnected event every now and then, which is generally
> > resolved on the order of milliseconds. Clearly, I don't want to recreate
> a
> > minutes-long process every time we get a milliseconds-long disconnection
> > which does not remove the service's existing leadership.
> >
> > So, when the leader receives a disconnected event, it queues up events to
> > process, but holds on to its connections and continues to receive events
> > while it waits for a connection to ZK to be re-established. If the
> > connection to ZK comes back online within the session timeout window,
> then
> > it will just turn processing back on as if nothing happened. However, if
> > the
> > session timeout happens, then the client must cut all of its connections
> > and
> > kill itself with fire, rather than overwrite what the next leader does.
> > Then
> > the next leader has to go through the expensive process of starting the
> > service back up.
> >
> > Hopefully that will give some color for why I'm concerned about this
> > situation.
> >
> > Thanks,
> >
> > Scott
> >
> > On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rckenned@gmail.com>
> wrote:
> >
> > > Scott:
> > >
> > >  the right answer in this case is for the leader to watch for the
> > > "disconnected" event and shut down. If the connection re-establishes,
> > > the leader should still be the leader (their ephemeral sequential node
> > > should still be there), in which case it can go back to work. If the
> > > connection doesn't re-establish, one of two things may happen…
> > >
> > > 1) Your leader stays in the disconnected state (because it's unable to
> > > reconnect), meanwhile the zookeeper server expires the session
> > > (because it hasn't seen a heartbeat), deletes the ephemeral sequential
> > > node and a new worker is promoted to leader.
> > >
> > > 2) Your leader quickly transitions to the expired state, the ephemeral
> > > node is lost and a new worker is promoted to leader.
> > >
> > > In both cases, your initial leader should see a disconnected event
> > > first. If it shuts down when it sees that event, you should be
> > > relatively safe in thinking that you only have one worker going at a
> > > time.
> > >
> > > Once your initial leader sees the expiration event, it can try to
> > > reconnect to the ensemble, create the new ephemeral sequential node
> > > and get back into the queue for being a leader.
> > >
> > > Ryan
> > >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message