From user-return-12193-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Oct 2 18:14:37 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 7693018064F for ; Wed, 2 Oct 2019 20:14:37 +0200 (CEST) Received: (qmail 70992 invoked by uid 500); 2 Oct 2019 18:14:35 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 70965 invoked by uid 99); 2 Oct 2019 18:14:35 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Oct 2019 18:14:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 001231A4102 for ; Wed, 2 Oct 2019 18:14:34 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.8 X-Spam-Level: * X-Spam-Status: No, score=1.8 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=2, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd2-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id xU_P78FZvVdQ for ; Wed, 2 Oct 2019 18:14:33 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2607:f8b0:4864:20::c2e; helo=mail-yw1-xc2e.google.com; envelope-from=jerry.hebert@gmail.com; receiver= Received: from mail-yw1-xc2e.google.com (mail-yw1-xc2e.google.com [IPv6:2607:f8b0:4864:20::c2e]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 61AFA7DDF6 for ; Wed, 2 Oct 2019 18:05:29 +0000 (UTC) Received: by mail-yw1-xc2e.google.com with SMTP id m7so29470ywe.4 for ; Wed, 02 Oct 2019 11:05:29 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:from:date:message-id:subject:to; bh=NX+MLgBI2zrVz4r7ZOPlJaFz08J88K9hTpra1x9tML0=; b=a4J2YgWfCyNSvTUDlvQJuiS5YzXN+Us1eRZ8ijp0ySpEfaLg6TpfClsnYaS/DkU5te uREZ9wSR7EG/PrTrgBCf9xbdaEMM4S2EeHIEK74sk4nMZQGjPZwM61OOY1n8aXMAqEaj RgS37uaDa8r1DUv6dN14x+3wWkPJf85J10kImN27C3LnSqTn7orMlCJGFDu+flOtbajO kg9rkge39h/q7t67fU2410BoY5WKJGbxj3gtZR98Zc9yuePSmrmajlJofPS1kcYLH95t Srn/gf1KOtk93DU5Al8jlQMtJWV3iZ7gShcvPRz6ju8jsncfnezp5UF+QOc7lDJFotgu QL9Q== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:from:date:message-id:subject:to; bh=NX+MLgBI2zrVz4r7ZOPlJaFz08J88K9hTpra1x9tML0=; b=a7c+U9l6lZDZOP3Y+VRjICbIlxZnvWm4JL1DRKQ/QVUcxTldMl9CY0l3PuYPcR2e0y dHQO3XEq7OlHWwQUgCSDS/WroYpYTot3VGNrpptRO8YBKpLpdOsZTWTlxJW/PFp8eZ0r wyCkvnNe4vcZBfz9uOWuGJYpJOS9Rlg2/hUZzROcRwMCBT4T689YNz4kGQX49Sl9E6d4 wbqGyQfiS14pXgZfgr1fsdraGwPq6FQ84q5YzOdHwjDXjOtau1GgLvev7cSE10pFfA4m gC76mU0FPAhhOSx+QmFJKtx7/PdY+GbdBgvuseI4Sw3+tUcOSPS2FsIKoLb96Y1M3e4m 7teQ== X-Gm-Message-State: APjAAAUadwotpBQIS8KH51xPhInuSPJcRQ3RnBBjbGxsHKQ6VL5X5hi3 0RrvDi43+KrntbxajeI6/T0mewTUSLDGuaghaGRgU+bI X-Google-Smtp-Source: APXvYqy/ZRhYx+hWN3fZ+m7sgb1ujIxEB00sxxKS4VuvcI9RkSraYkiBh4qvhgAJ9AXj1yQplXJFm/BhVdA6s8AwT3g= X-Received: by 2002:a81:8453:: with SMTP id u80mr3515922ywf.481.1570039521840; Wed, 02 Oct 2019 11:05:21 -0700 (PDT) MIME-Version: 1.0 From: Jerry Hebert Date: Wed, 2 Oct 2019 11:05:06 -0700 Message-ID: Subject: One node crashing in 3.4.11 triggered a full ensemble restart To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary="000000000000fb949d0593f14fb8" --000000000000fb949d0593f14fb8 Content-Type: text/plain; charset="UTF-8" Hi all, My first post here! I'm hoping you all might be able to offer some guidance or redirect me to an existing ticket. We have a five node ensemble on 3.4.11 that we're currently in the process of upgrading to 3.5.5. We recently saw some bizarre behavior in our ensemble that I was hoping to find some sort pre-existing ticket or discussion about but I was having difficulty finding hits for this in Jira. The behavior that we saw from our metrics is that one of our nodes (not sure if it was a follower or a leader) started to demonstrate instability (high CPU, high RAM) and it crashed. Not a big deal, but as soon as it crashed, all of the other four nodes all immediately restarted, resulting in a short outage. One node crashing should never cause an ensemble restart of course, so I assumed that this must be a bug in ZK. The nodes that restarted had no indication of errors in their logs, they just simply restarted. Does this sound familiar to any of you? Also, we are using Exhibitor on that ensemble so it's also possible that the restart was caused by Exhibitor. My hope is that this issue will be behind us once the 3.5.5 upgrade is complete but I'd ideally like to find some concrete evidence of this. Thanks! Jerry --000000000000fb949d0593f14fb8--