zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Hebert <jerry.heb...@gmail.com>
Subject One node crashing in 3.4.11 triggered a full ensemble restart
Date Wed, 02 Oct 2019 18:05:06 GMT
Hi all,

My first post here! I'm hoping you all might be able to offer some guidance
or redirect me to an existing ticket. We have a five node ensemble on
3.4.11 that we're currently in the process of upgrading to 3.5.5. We
recently saw some bizarre behavior in our ensemble that I was hoping to
find some sort pre-existing ticket or discussion about but I was having
difficulty finding hits for this in Jira.

The behavior that we saw from our metrics is that one of our nodes (not
sure if it was a follower or a leader) started to demonstrate
instability (high CPU, high RAM) and it crashed. Not a big deal, but as
soon as it crashed, all of the other four nodes all immediately restarted,
resulting in a short outage. One node crashing should never cause an
ensemble restart of course, so I assumed that this must be a bug in ZK. The
nodes that restarted had no indication of errors in their logs, they just
simply restarted. Does this sound familiar to any of you?

Also, we are using Exhibitor on that ensemble so it's also possible that
the restart was caused by Exhibitor.

My hope is that this issue will be behind us once the 3.5.5 upgrade is
complete but I'd ideally like to find some concrete evidence of this.

Thanks!
Jerry

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message