zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Unexpected behavior with Session Timeouts in Java Client
Date Thu, 21 Apr 2011 23:26:06 GMT
Nice summary.  Your situation is definitely not the normal case and I
completely misunderstood your original question.

Since you can survive double connections, I would recommend something a bit
different than before.

Your master can keep downloading during a connection loss and stop
downloading at t_loss + session-expiration + epsilon if connectivity with ZK
is not re-established or whenever an expiration event is received.  For
momentary connection losses, the master would continue downloads with no
interruption and nobody would be tempted to take over.  For longer
connection losses, you may have epsilon time period with the old master
still downloading.

In the worst case, you may have a network partition followed by ZK restart
which would expire the session early and cause somebody else to take over
the download early.  This is a fairly bizarre scenario with only moderate
cost and pretty low likelihood.  Certain power loss scenarios are the most
common way for this to happen.

It should pointed out that you can have connection loss in normal operation
due to the standard procedure of doing a rolling restart during an upgrade.
 This will cause a transient connection loss without session expiration as
the server you are connected to restarts.  You should plan for these events
because otherwise you can't upgrade ZK.

On Thu, Apr 21, 2011 at 3:41 PM, Scott Fines <scottfines@gmail.com> wrote:

> Perhaps I am not being clear in my description.
> I'm building a system, that receives data events from an external source.
> This external source is not necessarily under my control, and starting up a
> connection to that source is highly expensive, as it entails a
> high-latency,
> low-bandwith transfer of data. It's more stable than packet radio, but not
> a
> whole lot faster. This system retains the ability to recreate events, and
> can do so upon request, but the cost to recreate is extremely high.
> On the other end, the system pushes these events on to distributed
> processing and (eventually) a long-term storage situation, like Cassandra
> or
> Hadoop. Each event that is received can be idempotently applied, so the
> system can safely process duplicate messages if they come in.
> If the world were perfect and we had Infiniband connections to all of our
> external sources, then there would be no reason for a leader-election
> protocol in this scenario. I would just boot the system on every node, and
> have them do their thing, and why worry? Idempotency is a beautiful thing.
> Sadly, the world is not perfect, and trying to deal with an already slow
> external connection by asking it to send the same data 10 or 15 times is
> not
> a great idea, performance-wise. In addition to slowing everything down on
> the receiving end, it also has an adverse affect on the source's
> performance; the source, it must be noted, has other things to do besides
> just feeding my system data.
> So my solution is to limit the number of external connection to 1, and use
> ZooKeeper leader-elections to manage which machine is running at which
> time.
> This way, we keep the number of external connections down as low as we go,
> we can guarantee that messages are received and processed idempotently, and
> in the normal situation where there is no trouble at all, life is fine.
> What I am trying to deal with right now is how to manage the corner cases
> of
> when communication with ZooKeeper breaks down.
> To answer your question about the ZooKeeper cluster installation: no, it is
> not located in multiple data centers. It is, however, co-located with other
> processes. For about 90-95% (we have an actual measurement, but I can't
> remember it off the top of my head) of the time, the resource utilization
> is
> low enough and ZooKeeper is lightweight enough that it makes sense to
> co-locate. Occasionally, however, we do see a spike in an individual
> machine's utilization. Even more occasionally, that spike can result in
> clients being disconnected from that ZooKeeper node. Since almost all the
> remainder of the cluster is reachable and appropriately utilized, clients
> typically reconnect to another node, and all is well. Of course, with this
> arrangement, there is a small chance that the entire cluster could blow out
> in usage, which would result in catastrophic ZooKeeper failure. This hasn't
> happened yet, but I'm sure it's only a matter of time. However, if this
> were
> to happen, the co-located services would also be at the tipping point, and
> would probably crash soon after. Since that system is mission-critical, we
> have lots of monitoring in place, as well as a couple of spare nodes to
> throw at the cluster in case the overall usage is high. So, in short, it
> isn't terribly surprising that we might occasionally get a connection loss
> from ZooKeeper, but in almost all cases, that connection loss can be
> resolved very quickly.
> What can NOT be resolved quickly is our leader failing. This is not
> catastrophic, but it does negatively impact performance. Therefore, I wish
> to avoid needlessly creating external connections. If we shut down the
> external connection upon every connection loss, we would pay a price in
> minutes what could easily be resolved in milliseconds. If, however, we
> wait,
> keeping a connection open, knowing that we are still probabilistically the
> leader, we can continue to receive events. Once the leadership has been
> passed on, however, we should shut down to avoid excess connections.
> What's puzzling to me here is the client expectation. On the one hand,
> there
> is a SessionExpired Event, which clearly indicates that a session has
> expired. However, the session only expires if ZooKeeper fails to hear from
> you within a specified time frame. This specified time is agreed upon, and
> is known by both ZK and the client. Therefore, one expectation would be
> that
> the SessionExpired event would be fired when that timeframe has been
> exhausted before a connection with ZK could be reestablished. On the other
> hand, Session Expired events only happen AFTER a reconnection to the
> cluster. This is strange to me, because it seems that the client has all
> the
> information it needs to know that it's session has been expired by
> ZooKeeper, and it doesn't really need to first talk to the Server to pass
> that information back to the application. It's made particularly strange to
> me when I think that, for most cases, a failure to talk to ZooKeeper within
> the session timeout probably means that you aren't going to be able to
> reconnect with it for a lot longer yet, and client is probably going to
> just
> keep failing. If you were talking about total machine failure, then who
> cares? The application is dead anyway. But if you're talking about a
> network
> partition, where the node could conceivably still do its job successfully,
> and is only prevented from doing so by a connection to ZooKeeper, as in my
> use case, you want that information in a reasonably timely manner.
> Judging from my questions, the relative number of questions about
> SessionExpiration versus disconnect that have been answered, and the fact
> that Twitter and Travis (and presumably others as well) have felt the need
> to implement this behavior outside of the system, it seems like I'm not the
> only one confused about this. So my question is: Why does the ZooKeeper
> client do notification only upon reconnect? Is there a
> performance/consistency reason that I'm not seeing?
> Thanks, and sorry for the uber-long message,
> Scott
> On Thu, Apr 21, 2011 at 4:47 PM, Ted Dunning <ted.dunning@gmail.com>
> wrote:
> > Scott,
> >
> > Having your master enter a suspended state is fine, but it cannot act as
> > master during this time (because somebody else may have become master
> > during
> > this time).
> >
> > It is fine to enter a suspended mode, but the suspended master cannot
> > commit
> > to any actions as a master.  Any transactions that it accepts must be
> > considered as expendable.  Usually that means that whoever sent the
> > transactions must retain them until the suspended master regains its
> senses
> > or relinquishes its master state.
> >
> > The other question that comes up from your description is how your ZK
> > cluster works.  Do you have zookeeper split across data centers?
> >
> > On Thu, Apr 21, 2011 at 1:45 PM, Scott Fines <scottfines@gmail.com>
> wrote:
> >
> > > Ryan,
> > >
> > > That is a fair point in that I would have consistency of services--that
> > is,
> > > that I would be pretty sure I'd only have one service running at a
> time.
> > > However, my particular application demands are such that just stopping
> > and
> > > re-starting on disconnected events is not a good idea.
> > >
> > > What I'm writing is a connector between two data centers, where the
> > > measured
> > > latency is on the order of seconds, and each time a service connects,
> it
> > > must transfer (hopefully only a few) megabytes of data, which I've
> > measured
> > > to take on the order of minutes. On the other hand, it is not unusual
> for
> > > us
> > > to receive a disconnected event every now and then, which is generally
> > > resolved on the order of milliseconds. Clearly, I don't want to
> recreate
> > a
> > > minutes-long process every time we get a milliseconds-long
> disconnection
> > > which does not remove the service's existing leadership.
> > >
> > > So, when the leader receives a disconnected event, it queues up events
> to
> > > process, but holds on to its connections and continues to receive
> events
> > > while it waits for a connection to ZK to be re-established. If the
> > > connection to ZK comes back online within the session timeout window,
> > then
> > > it will just turn processing back on as if nothing happened. However,
> if
> > > the
> > > session timeout happens, then the client must cut all of its
> connections
> > > and
> > > kill itself with fire, rather than overwrite what the next leader does.
> > > Then
> > > the next leader has to go through the expensive process of starting the
> > > service back up.
> > >
> > > Hopefully that will give some color for why I'm concerned about this
> > > situation.
> > >
> > > Thanks,
> > >
> > > Scott
> > >
> > > On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy <rckenned@gmail.com>
> > wrote:
> > >
> > > > Scott:
> > > >
> > > >  the right answer in this case is for the leader to watch for the
> > > > "disconnected" event and shut down. If the connection re-establishes,
> > > > the leader should still be the leader (their ephemeral sequential
> node
> > > > should still be there), in which case it can go back to work. If the
> > > > connection doesn't re-establish, one of two things may happen…
> > > >
> > > > 1) Your leader stays in the disconnected state (because it's unable
> to
> > > > reconnect), meanwhile the zookeeper server expires the session
> > > > (because it hasn't seen a heartbeat), deletes the ephemeral
> sequential
> > > > node and a new worker is promoted to leader.
> > > >
> > > > 2) Your leader quickly transitions to the expired state, the
> ephemeral
> > > > node is lost and a new worker is promoted to leader.
> > > >
> > > > In both cases, your initial leader should see a disconnected event
> > > > first. If it shuts down when it sees that event, you should be
> > > > relatively safe in thinking that you only have one worker going at a
> > > > time.
> > > >
> > > > Once your initial leader sees the expiration event, it can try to
> > > > reconnect to the ensemble, create the new ephemeral sequential node
> > > > and get back into the queue for being a leader.
> > > >
> > > > Ryan
> > > >
> > >
> >

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message