zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: leap second excitement
Date Mon, 02 Jul 2012 16:36:12 GMT
Thanks for the report Scott, from what I've seen so far this seems to
be a Linux bug and not specific to java/ZK, here are a couple of the
more informative link's I've seen:

Anyone have specific insight into how this expressed itself in java?
I've seen some references to futex being the root (from java
perspective) "It's a critical Linux bug that causes futex to timeout,
and anything that uses it to behave incorrectly."


On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <Scott.Fines@nisc.coop> wrote:
> Hello all,
> It appears that ZooKeeper is subject to the linux leap seconds bug that has caused problems
with Cassandra and other services. At least, I discovered that after 6 hours of trying to
figure out why my cluster wasn't giving me a quorum.
> A link to the kernel bug report is  at https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
> As far as what you might see in your logs, I saw a lost quorum, insanely high load on
my servers, and when I shut down zookeeper to bring it back up, one machine would report a
read timeout during leader election, then report that the server told it to shut down. After
that, it would forever be stuck in the LOOKING phase, while another machine might be stuck
in any other phase of the election.
> The fix is simple, though. Just stop ZooKeeper, execute
> date -s "`date`"
> or restart your ntp daemon, then start zookeeper back up.
> you MUST restart zookeeper, otherwise, the election state doesn't recover (or, at least,
it didn't recover for me)
> Hope this helps save someone else the 7 hours of agony I just went through.
> Scott Fines

View raw message