From user-return-12204-apmail-zookeeper-user-archive=zookeeper.apache.org@zookeeper.apache.org Thu Oct 3 13:59:23 2019 Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by minotaur.apache.org (Postfix) with SMTP id 07C0B19148 for ; Thu, 3 Oct 2019 13:59:22 +0000 (UTC) Received: (qmail 49499 invoked by uid 500); 3 Oct 2019 13:59:19 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 49456 invoked by uid 500); 3 Oct 2019 13:59:18 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 49444 invoked by uid 99); 3 Oct 2019 13:59:18 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 03 Oct 2019 13:59:18 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id D8DDDC116F for ; Thu, 3 Oct 2019 13:59:17 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd1-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id uqHzdUDBhrLY for ; Thu, 3 Oct 2019 13:59:15 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::232; helo=mail-oi1-x232.google.com; envelope-from=eolivelli@gmail.com; receiver= Received: from mail-oi1-x232.google.com (mail-oi1-x232.google.com [IPv6:2607:f8b0:4864:20::232]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 04E747E181 for ; Thu, 3 Oct 2019 13:59:14 +0000 (UTC) Received: by mail-oi1-x232.google.com with SMTP id e18so2706671oii.0 for ; Thu, 03 Oct 2019 06:59:14 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=7p7FI6t+kmynADp441r11JB3Jk9vdn+WgwSgX/PcCTk=; b=pphI+d55434FDIkrreftGYPLZoFS5gM3WqUvo01goHLn5NylK39eQtsh9baIZxDXDu UgDTPTlBR13xCgKoGmy10cyc70zs6tp34BLu/PKisDRQr/YctCd9dC7inPF5ugrfxJWL jXvMMdMo2J6hyiSLZXScoYLQ3XKNkBx75gkylTFoAQXeemv7rzIkN/OVo+WzDqRQajUd u5SJzwWvD6mup+pDVcbLZUMvJQtUFNsElFNDWNCtMHgs6DdN1y5z/4QgSvH5N5gRvqtJ avIBFJ0HiaTasTJCE6olkuZjmNoBk3zY4mPcHFNHjjEHvauDBqT86SXzwFgnT2IB3Zu0 xsXA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=7p7FI6t+kmynADp441r11JB3Jk9vdn+WgwSgX/PcCTk=; b=cIZ2OT/o7dKPGGsmbdUkVAYlfHFdYwalKmhc8AUojAwPST/OwYUFzH1K4Djukmr4g9 G0jKBuZ5fUE9wCqIJ9yl6GhX8GVyHcsbULZoDZa5W5CBHzg8sXGVWewuICVY5hmhp/jZ ttWO7SZjYsFp9WXXWGaIJ1mLnzmiDcp2z4cYqrUUwXotQ/fomQKyduFyxlQT96xO03UK YqqpnfNcCY4jT7ECPQhvScmpsJjPyA++EIgvKRYmN3k25lHu4wQXaq57D3Ekz8HmkeH6 il2IpMV/ZfsHF9zX5yaZ35+0QNR/B2aq+ZFMUUdljl/T8vzoW8btjAskO9I8xp7XFFV4 1Exw== X-Gm-Message-State: APjAAAWY7plDe+Ql4b85XW3pF6JERHqFVeqmccDL+uCn4HdCx0vxVOHw Hu8lhS8dKUpHVAJB+6ZcxRQGZjivqLx0oWgRbCjhTQ== X-Google-Smtp-Source: APXvYqzWPsUqLBlfg2KQT+gfJNxvNt/KJDt0YZyJLEEoG6rY71BPDO6tWWdMNyP+DNCUFwvJlWnG+FXOgaBEs8dvu/c= X-Received: by 2002:aca:58c5:: with SMTP id m188mr2854690oib.74.1570111152014; Thu, 03 Oct 2019 06:59:12 -0700 (PDT) MIME-Version: 1.0 References: <4C79EF08-B5F8-4518-B3EE-D93390276CFE@gmail.com> In-Reply-To: <4C79EF08-B5F8-4518-B3EE-D93390276CFE@gmail.com> From: Enrico Olivelli Date: Thu, 3 Oct 2019 15:58:57 +0200 Message-ID: Subject: Re: One node crashing in 3.4.11 triggered a full ensemble restart To: UserZooKeeper Content-Type: multipart/alternative; boundary="000000000000794834059401fd29" --000000000000794834059401fd29 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable I think it is possible to perform a rolling upgrade from 3.4, all of my customers migrated one year ago and without any issue (reported to my team)= . Norbert, where did you find that information? btw I would like to setup tests about backward compatibility, server-to-server and client-to-server Enrico Il giorno gio 3 ott 2019 alle ore 15:16 J=C3=B6rn Franke ha scritto: > I tried only from 3.4.14 and there it was possible. I recommend first to > upgrade to the latest 3.4 version and then to 3.5 > > > Am 02.10.2019 um 21:40 schrieb Jerry Hebert : > > > > =EF=BB=BFHi J=C3=B6rn, > > > > No, this was a very intermittent issue. We've been running this ensembl= e > > for about four years now and have never seen this problem so it seems t= o > be > > super heisenbuggy. Our upgrade process will be more involved than what > you > > described (we're switching networks, instance types, underlying > automation > > and removing Exhibitor) but I'm glad you asked because I have a questio= n > > about that too. :) > > > > Are you saying that a 3.5.5 node can synchronize with a 3.4.11 ensemble= ? > I > > wasn't sure if that would work or not. e.g., maybe I could bring up the > new > > 3.5.5 ensemble and temporarily form a 10-node ensemble (five 3.4.11 > nodes, > > five 3.5.5 nodes), let them sync and then kill off the old 3.4.11 boxes= ? > > > > Thanks, > > Jerry > > > >> On Wed, Oct 2, 2019 at 12:29 PM J=C3=B6rn Franke > wrote: > >> > >> Have you tried to stop the node, delete the data and log directory, > >> upgrade to 3.5.5 , start the node and wait until it is synchronized ? > >> > >>>> Am 02.10.2019 um 20:14 schrieb Jerry Hebert = : > >>> > >>> =EF=BB=BFHi all, > >>> > >>> My first post here! I'm hoping you all might be able to offer some > >> guidance > >>> or redirect me to an existing ticket. We have a five node ensemble on > >>> 3.4.11 that we're currently in the process of upgrading to 3.5.5. We > >>> recently saw some bizarre behavior in our ensemble that I was hoping = to > >>> find some sort pre-existing ticket or discussion about but I was havi= ng > >>> difficulty finding hits for this in Jira. > >>> > >>> The behavior that we saw from our metrics is that one of our nodes (n= ot > >>> sure if it was a follower or a leader) started to demonstrate > >>> instability (high CPU, high RAM) and it crashed. Not a big deal, but = as > >>> soon as it crashed, all of the other four nodes all immediately > >> restarted, > >>> resulting in a short outage. One node crashing should never cause an > >>> ensemble restart of course, so I assumed that this must be a bug in Z= K. > >> The > >>> nodes that restarted had no indication of errors in their logs, they > just > >>> simply restarted. Does this sound familiar to any of you? > >>> > >>> Also, we are using Exhibitor on that ensemble so it's also possible > that > >>> the restart was caused by Exhibitor. > >>> > >>> My hope is that this issue will be behind us once the 3.5.5 upgrade i= s > >>> complete but I'd ideally like to find some concrete evidence of this. > >>> > >>> Thanks! > >>> Jerry > >> > --000000000000794834059401fd29--