zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Nauroth <cnaur...@hortonworks.com>
Subject Re: Interesting elastic/ZK post
Date Mon, 09 May 2016 17:12:18 GMT
I always sympathize with a major outage report, but on the bright side, it
was very satisfying to hear the ZooKeeper cluster had sustained uptime for
3 years.  That agrees with my own user experience.  It's often the most
stable component of a distributed infrastructure (as it needs to be).

As far as potential improvements, I was wondering if it would make sense
to introduce something like Hadoop's JvmPauseMonitor [1].  This is a
background thread that attempts to detect GC churn and log warnings about
it.  This has been very helpful in diagnosing NameNode misconfigurations
that lead to GC churn.

This wouldn't have prevented a problem for the Elastic Cloud team, but at
least it would have made the root cause more visible.  A warning about GC
churn could have been shown in the main ZooKeeper log instead of a
separate GC log or inferring it from other sources like JMX.

[1] https://s.apache.org/4sdx

--Chris Nauroth




On 5/8/16, 7:37 PM, "Patrick Hunt" <phunt@apache.org> wrote:

>Interesting root cause and mitigations discussion.
>
>https://www.elastic.co/blog/elastic-cloud-outage-april-2016
>
>Patrick


Mime
View raw message