zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Kathleen Ting <kathl...@cloudera.com>
Subject Re: leap second excitement
Date Tue, 24 Jul 2012 20:51:58 GMT
Patrick, agreed. I've seen additional threads referencing this thread and
thought I would follow-up with what I've learned since.

Due to a missed function call in the Linux timekeeping code, the leap
second was not accounted for properly. As a result, after the leap second,
timers expired one second earlier than requested. Many applications use a
recurring timer of 1 second or less; such timers expired immediately,
causing the application to immediately try to set another timer, ad
infinitum. This infinite loop led to CPU load spikes.

In case of interest, we wrote a blog post detailing it:

Regards, Kathleen

On Mon, Jul 2, 2012 at 9:36 AM, Patrick Hunt <phunt@apache.org> wrote:

> Thanks for the report Scott, from what I've seen so far this seems to
> be a Linux bug and not specific to java/ZK, here are a couple of the
> more informative link's I've seen:
> http://hackerne.ws/item?id=4188412
> http://blog.mozilla.org/it/2012/06/30/mysql-and-the-leap-second-high-cpu-and-the-fix
> Anyone have specific insight into how this expressed itself in java?
> I've seen some references to futex being the root (from java
> perspective) "It's a critical Linux bug that causes futex to timeout,
> and anything that uses it to behave incorrectly."
> Patrick
> On Sun, Jul 1, 2012 at 2:58 PM, Scott Fines <Scott.Fines@nisc.coop> wrote:
> > Hello all,
> >
> > It appears that ZooKeeper is subject to the linux leap seconds bug that
> has caused problems with Cassandra and other services. At least, I
> discovered that after 6 hours of trying to figure out why my cluster wasn't
> giving me a quorum.
> >
> > A link to the kernel bug report is  at
> https://git.kernel.org/?p=linux/kernel/git/torvalds/linux-2.6.git;a=commit;h=6b43ae8a619d17c4935c3320d2ef9e92bdeed05d
> >
> > As far as what you might see in your logs, I saw a lost quorum, insanely
> high load on my servers, and when I shut down zookeeper to bring it back
> up, one machine would report a read timeout during leader election, then
> report that the server told it to shut down. After that, it would forever
> be stuck in the LOOKING phase, while another machine might be stuck in any
> other phase of the election.
> >
> > The fix is simple, though. Just stop ZooKeeper, execute
> >
> > date -s "`date`"
> >
> > or restart your ntp daemon, then start zookeeper back up.
> >
> > you MUST restart zookeeper, otherwise, the election state doesn't
> recover (or, at least, it didn't recover for me)
> >
> > Hope this helps save someone else the 7 hours of agony I just went
> through.
> >
> > Scott Fines

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message