zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Enrico Olivelli <eolive...@gmail.com>
Subject Re: One node crashing in 3.4.11 triggered a full ensemble restart
Date Thu, 03 Oct 2019 13:58:57 GMT
I think it is possible to perform a rolling upgrade from 3.4, all of my
customers migrated one year ago and without any issue (reported to my team).

Norbert, where did you find that information?

btw I would like to setup tests about backward compatibility,
server-to-server and client-to-server

Enrico

Il giorno gio 3 ott 2019 alle ore 15:16 Jörn Franke <jornfranke@gmail.com>
ha scritto:

> I tried only from 3.4.14 and there it was possible. I recommend first to
> upgrade to the latest 3.4 version and then to 3.5
>
> > Am 02.10.2019 um 21:40 schrieb Jerry Hebert <jerry.hebert@gmail.com>:
> >
> > Hi Jörn,
> >
> > No, this was a very intermittent issue. We've been running this ensemble
> > for about four years now and have never seen this problem so it seems to
> be
> > super heisenbuggy. Our upgrade process will be more involved than what
> you
> > described (we're switching networks, instance types, underlying
> automation
> > and removing Exhibitor) but I'm glad you asked because I have a question
> > about that too. :)
> >
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > wasn't sure if that would work or not. e.g., maybe I could bring up the
> new
> > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> nodes,
> > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes?
> >
> > Thanks,
> > Jerry
> >
> >> On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jornfranke@gmail.com>
> wrote:
> >>
> >> Have you tried to stop the node, delete the data and log directory,
> >> upgrade to 3.5.5 , start the node and wait until it is synchronized ?
> >>
> >>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert <jerry.hebert@gmail.com>:
> >>>
> >>> Hi all,
> >>>
> >>> My first post here! I'm hoping you all might be able to offer some
> >> guidance
> >>> or redirect me to an existing ticket. We have a five node ensemble on
> >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We
> >>> recently saw some bizarre behavior in our ensemble that I was hoping to
> >>> find some sort pre-existing ticket or discussion about but I was having
> >>> difficulty finding hits for this in Jira.
> >>>
> >>> The behavior that we saw from our metrics is that one of our nodes (not
> >>> sure if it was a follower or a leader) started to demonstrate
> >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but as
> >>> soon as it crashed, all of the other four nodes all immediately
> >> restarted,
> >>> resulting in a short outage. One node crashing should never cause an
> >>> ensemble restart of course, so I assumed that this must be a bug in ZK.
> >> The
> >>> nodes that restarted had no indication of errors in their logs, they
> just
> >>> simply restarted. Does this sound familiar to any of you?
> >>>
> >>> Also, we are using Exhibitor on that ensemble so it's also possible
> that
> >>> the restart was caused by Exhibitor.
> >>>
> >>> My hope is that this issue will be behind us once the 3.5.5 upgrade is
> >>> complete but I'd ideally like to find some concrete evidence of this.
> >>>
> >>> Thanks!
> >>> Jerry
> >>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message