zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ted Dunning <ted.dunn...@gmail.com>
Subject Re: Zookeeper on short lived VMs and ZOOKEEPER-107
Date Fri, 16 Mar 2012 15:51:57 GMT
On Fri, Mar 16, 2012 at 9:56 AM, Christian Ziech

> Under normal circumstances the ability to detect failures correctly should
> be given. The scenario I'm aware of includes one zookeeper system would be
> taken down for a reason and then possibly just rebooted or even started
> from scratch elsewhere. In both cases however the new host would have the
> old dns name but most likely a different IP. But of course that only
> applies to us and possibly not to all of the users.

This is a bizarre way to start a post on HA considerations.

Detecting failures is always subject to errors.  You can make the detection
process less broken, but there is a core uncertainty that is inherent in
the problem.  You bias the detection process toward false positives or
false negatives, but you can't completely get rid of either kind of error
without substantially increasing the total number of errors.

Most people bias strongly toward false negatives (system is marked as up,
but is down) if only because the system impact of false positives can be
quite high and because the cost of pushing toward faster detection of
failures can also be very high (consider what it would mean to have pings
every 100ms... the server under test would have to be re-designed from the
ground up with hard real-time principles in mind).

Given this context, all HA designs have to account for erroneous marking of

This is closely related to the CAP theorem.  There, the whole point is that
you can't really distinguish the alternatives that you are cut off from the
system in question or that it is down.  In practice, your uncertainty is
even worse than that.

So you really have to design around a statement that the failure detection
system will have a (1-epsilon_1) probability of being correct when it marks
systems as down and that it will have a (1-epsilon_2) probability of
detecting failures within t_1 seconds.  Furthermore, the probability of
detecting failures should smoothly transition to (1-epsilon_3) within t_2
seconds.  For heartbeat based systems where n heart-beats must be lost,
epsilon_1 is pretty small, but distinctly non-zero, epsilon_2 and t_1 are 1
and n-1 ticks respectively and epsilon_3 is on the close order of epsilon_1
and t_2 is somewhere near n ticks.  This implies that you cannot detect
failures in less than a certain amount of time and that you will still miss
some failures.  When I am designing, I try to avoid assuming that epsilon_1
and epsilon_3 are less than about 0.1%.

If you factor in a model like this into your design, you inherently no
longer make statements like "failure is impossible".  Instead, you say
failure has probability < p of occurring in t seconds.  If you persist in
the former, you will be very wrong much of the time and will be unable to
optimize the correct function of your system, nor recognize what is
happening when it does fail (as it will).

- Same scenarios as you described - nodes A with host name a, B host name b
> and C with host name c
> - Also same as in your scenario C is due to some human error falsely
> detected as down. Hence C' is brought up and is assigned the same DNS name
> as C
> - Now rolling restarts are performed to bring in C'
> - A resolves c correctly to the new IP and connects to C' but B still
> resolves the host name c to the original address of C and hence does not
> connect (I think some DNS slowness is also required for your approach in
> order for the host name c being resolved to the original IP of C)

This is hardly surprising given DNS timeouts and caching.  Consider what
would happen if B has C cached and is partitioned away from the DNS server.

> - now the rest of your scenario happens: Update U is applied, C' gets
> slow, C recovers and A fails.
> Of course also this approach requires some DNS craziness but if I did not
> make a mistake in my thoughts it should still be possible.

This isn't craziness.  This is reality.  And frankly, you are assuming that
A and B are even being served by the same DNS server.  My experience is
that DNS is messed up at an astonishing percentage of otherwise very
sophisticated installations.  You are assuming that DNS can handle a task
(fast updates) that most systems do not assume that it can do.  Note that
simply making the claim "my DNS is not messed up" is only very weak
evidence to me that your DNS is broken somehow.  Most of the admins of
correctly operating DNS say "we have had broken DNS in the past, tell me
what you need and I will check".

> PS: Wouldn't your scenario not also invalidate the solution of the hbase
> guys using amazons elastic ips to solve the same problem (see
> https://issues.apache.org/**jira/browse/HBASE-2327<https://issues.apache.org/jira/browse/HBASE-2327>
> )?

Don't think so.  Those guys are replacing the IP address itself so all
traffic inherently moves to the new machine.  There can be a short window
of misdirection, but elastic IP's work very well.  Moreover, you can firmly
take down the original on EC2 and you can release the IP manually which
makes the errors be almost entirely that the server in question simply
cannot be reached rather than there being uncertainty about which is being

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message