zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Norbert Kalmar <nkal...@cloudera.com.INVALID>
Subject Re: One node crashing in 3.4.11 triggered a full ensemble restart
Date Thu, 03 Oct 2019 08:45:41 GMT
Hi,

Here are the issues we encountered so far upgrading to 3.5.5 from 3.4:
https://cwiki.apache.org/confluence/display/ZOOKEEPER/Upgrade+FAQ

As Enrico mentioned, nothing similar so far. One is no snapshot taken yet
the other is 4 letter words needs to be whitelisted.

As for running a mixed version of 3.5 and 3.4 quorum - I'm afraid it will
not work. From 3.5 we have a check on PROTOCOL_VERSION. 3.4 did not have
this protocol version, so when the nodes try to communicate it will throw
an exception. Plus, it is not a goal to keep quorum protocol backward
compatible, so chances are even without the check it would not work.

Regards,
Norbert

On Thu, Oct 3, 2019 at 12:09 AM Enrico Olivelli <eolivelli@gmail.com> wrote:

> Il mer 2 ott 2019, 22:52 Jerry Hebert <jerry.hebert@gmail.com> ha scritto:
>
> > Hi Enrico,
> >
> > The nodes that restarted did not have any errors in their logs, they
> seemed
> > to simply restart successfully so I think your hunch about the external
> > system is probably correct.
> >
> > Could you comment on my second question above regarding cross-version
> > migration or should I make a new thread?
> >
>
>
> I am not aware of any issue about an upgrade from 3.4 to 3.5 similar to
> your case. It is expected to work.
>
> Enrico
>
>
> > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble?
> I
> > > wasn't sure if that would work or not. e.g., maybe I could bring up the
> > new
> > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > nodes,
> > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> boxes?
> >
> >
> > Thanks!
> > Jerry
> >
> > On Wed, Oct 2, 2019 at 1:12 PM Enrico Olivelli <eolivelli@gmail.com>
> > wrote:
> >
> > > Any particular error/stacktrace in the logs?
> > > If it is zookeeper that is self killing it should log it, otherwise is
> > some
> > > other external system, I am sorry I don't know Exhibitor
> > >
> > > Hope that helps
> > > Enrico
> > >
> > > Il mer 2 ott 2019, 21:40 Jerry Hebert <jerry.hebert@gmail.com> ha
> > scritto:
> > >
> > > > Hi Jörn,
> > > >
> > > > No, this was a very intermittent issue. We've been running this
> > ensemble
> > > > for about four years now and have never seen this problem so it seems
> > to
> > > be
> > > > super heisenbuggy. Our upgrade process will be more involved than
> what
> > > you
> > > > described (we're switching networks, instance types, underlying
> > > automation
> > > > and removing Exhibitor) but I'm glad you asked because I have a
> > question
> > > > about that too. :)
> > > >
> > > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11
> > ensemble?
> > > I
> > > > wasn't sure if that would work or not. e.g., maybe I could bring up
> the
> > > new
> > > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11
> > > nodes,
> > > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11
> > boxes?
> > > >
> > > > Thanks,
> > > > Jerry
> > > >
> > > > On Wed, Oct 2, 2019 at 12:29 PM Jörn Franke <jornfranke@gmail.com>
> > > wrote:
> > > >
> > > > > Have you tried to stop the node, delete the data and log directory,
> > > > > upgrade to 3.5.5 , start the node and wait until it is
> synchronized ?
> > > > >
> > > > > > Am 02.10.2019 um 20:14 schrieb Jerry Hebert <
> > jerry.hebert@gmail.com
> > > >:
> > > > > >
> > > > > > Hi all,
> > > > > >
> > > > > > My first post here! I'm hoping you all might be able to offer
> some
> > > > > guidance
> > > > > > or redirect me to an existing ticket. We have a five node
> ensemble
> > on
> > > > > > 3.4.11 that we're currently in the process of upgrading to 3.5.5.
> > We
> > > > > > recently saw some bizarre behavior in our ensemble that I was
> > hoping
> > > to
> > > > > > find some sort pre-existing ticket or discussion about but I
was
> > > having
> > > > > > difficulty finding hits for this in Jira.
> > > > > >
> > > > > > The behavior that we saw from our metrics is that one of our
> nodes
> > > (not
> > > > > > sure if it was a follower or a leader) started to demonstrate
> > > > > > instability (high CPU, high RAM) and it crashed. Not a big deal,
> > but
> > > as
> > > > > > soon as it crashed, all of the other four nodes all immediately
> > > > > restarted,
> > > > > > resulting in a short outage. One node crashing should never
cause
> > an
> > > > > > ensemble restart of course, so I assumed that this must be a
bug
> in
> > > ZK.
> > > > > The
> > > > > > nodes that restarted had no indication of errors in their logs,
> > they
> > > > just
> > > > > > simply restarted. Does this sound familiar to any of you?
> > > > > >
> > > > > > Also, we are using Exhibitor on that ensemble so it's also
> possible
> > > > that
> > > > > > the restart was caused by Exhibitor.
> > > > > >
> > > > > > My hope is that this issue will be behind us once the 3.5.5
> upgrade
> > > is
> > > > > > complete but I'd ideally like to find some concrete evidence
of
> > this.
> > > > > >
> > > > > > Thanks!
> > > > > > Jerry
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message