zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Scott Fines <Scott.Fi...@nisc.coop>
Subject leap second excitement
Date Sun, 01 Jul 2012 21:58:17 GMT
Hello all,

It appears that ZooKeeper is subject to the linux leap seconds bug that has caused problems
with Cassandra and other services. At least, I discovered that after 6 hours of trying to
figure out why my cluster wasn't giving me a quorum.

A link to the kernel bug report is  at https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d

As far as what you might see in your logs, I saw a lost quorum, insanely high load on my servers,
and when I shut down zookeeper to bring it back up, one machine would report a read timeout
during leader election, then report that the server told it to shut down. After that, it would
forever be stuck in the LOOKING phase, while another machine might be stuck in any other phase
of the election.

The fix is simple, though. Just stop ZooKeeper, execute

date -s "`date`"

or restart your ntp daemon, then start zookeeper back up.

you MUST restart zookeeper, otherwise, the election state doesn't recover (or, at least, it
didn't recover for me)

Hope this helps save someone else the 7 hours of agony I just went through.

Scott Fines

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message