zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jerry Hebert <jerry.heb...@gmail.com>
Subject Re: One node crashing in 3.4.11 triggered a full ensemble restart
Date Wed, 02 Oct 2019 19:40:29 GMT
Hi Jörn,

No, this was a very intermittent issue. We've been running this ensemble
for about four years now and have never seen this problem so it seems to be
super heisenbuggy. Our upgrade process will be more involved than what you
described (we're switching networks, instance types, underlying automation
and removing Exhibitor) but I'm glad you asked because I have a question
about that too. :)

Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble? I
wasn't sure if that would work or not. e.g., maybe I could bring up the new
3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 nodes,
five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?

Thanks,
Jerry

On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jornfranke@gmail.com> wrote:

> Have you tried to stop the node, delete the data and log directory,
> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
>
> > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <jerry.hebert@gmail.com>:
> >
> > Hi all,
> >
> > My first post here! I'm hoping you all might be able to offer some
> guidance
> > or redirect me to an existing ticket. We have a five node ensemble on
> > 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> > recently saw some bizarre behavior in our ensemble that I was hoping to
> > find some sort pre-existing ticket or discussion about but I was having
> > difficulty finding hits for this in Jira.
> >
> > The behavior that we saw from our metrics is that one of our nodes (not
> > sure if it was a follower or a leader) started to demonstrate
> > instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> > soon as it crashed, all of the other four nodes all immediately
> restarted,
> > resulting in a short outage. One node crashing should never cause an
> > ensemble restart of course, so I assumed that this must be a bug in ZK.
> The
> > nodes that restarted had no indication of errors in their logs, they just
> > simply restarted. Does this sound familiar to any of you?
> >
> > Also, we are using Exhibitor on that ensemble so it's also possible that
> > the restart was caused by Exhibitor.
> >
> > My hope is that this issue will be behind us once the 3.5.5 upgrade is
> > complete but I'd ideally like to find some concrete evidence of this.
> >
> > Thanks!
> > Jerry
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message