From user-return-11927-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Thu May 23 01:27:16 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8AAF4180651 for ; Thu, 23 May 2019 03:27:16 +0200 (CEST) Received: (qmail 3529 invoked by uid 500); 23 May 2019 01:27:13 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 3385 invoked by uid 99); 23 May 2019 01:27:12 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 23 May 2019 01:27:12 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id E6A051805BD; Thu, 23 May 2019 01:27:11 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 1.618 X-Spam-Level: * X-Spam-Status: No, score=1.618 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=2, KAM_SHORT=0.001, NORMAL_HTTP_TO_IP=0.001, RCVD_IN_DNSWL_NONE=-0.0001, RCVD_IN_MSPIKE_H2=-0.436, SPF_HELO_NONE=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001, WEIRD_PORT=0.001] autolearn=disabled Authentication-Results: spamd3-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ZwUq18GT_Zme; Thu, 23 May 2019 01:27:10 +0000 (UTC) Received: from mail-it1-f195.google.com (mail-it1-f195.google.com [209.85.166.195]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id EEDD661294; Thu, 23 May 2019 01:27:09 +0000 (UTC) Received: by mail-it1-f195.google.com with SMTP id g23so6935577iti.1; Wed, 22 May 2019 18:27:09 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=viAxwpS3QzeB84EWLd9EKv8gDWAxSMgnhzdukiSg9bo=; b=lLuV9md2DR36lEXsyOL52km69E2i56aZOe9I8vhPSxQiwTVwRGh359B3pkUHe1oEDZ 7+fJjvcN+uObMvFfNKnMOMLJgSr32nXgNrxL5Tba/wTMpV6RhMZrgIX62Kfo1DQfbWzt MSYAjQh2teXCk+xFYTW8vhOG/hBiJZpDE4oSDqPPwZVJqrGkH1KNsrOAsvbOr0zfrfxL CuHKIunnbMZsGZ+f8b7xxhFshGt338FxdmU4NQKyRg7Rsc+FCJVTaAaMHyRRbtQsNHvV SMa3cMnCdJu+/YvSZaonwltpRknP+5m5Rm6m3LGj46/AX0dsHnQkfmCNbVqLbNnIl8FC XzvA== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=viAxwpS3QzeB84EWLd9EKv8gDWAxSMgnhzdukiSg9bo=; b=XWNnChSjSXZefZxnvCKkPQcU8mjkK9tP3QuAh+C8ooJUVmf45lQyckZ2ggWMA+7x55 LOZutGk2Xg0l9Td1dJmAR+bk/FPmfCqI0Brq2ITHzthVFBkULBMhzHYxfIdOgsZ9JtKo 6WW3CUwNEJJJSukrYNKZBYo0PCaWS6fE2f4btN9oxcgQgz8TVkbKyF/ZCK2Wzigkj8Y9 4yqCETlXwMTrSWTtMBeJJOhWyYEP1bUtOSynrzrlDsP8KRdI6PQ+7N+QkJrXExk5QpJf vdjUuBSyux/O1tMzRFawhAo4ejmgOFCArlKVkV8wiIhQvT0z0NzYLKzPRh6fq+eI3U/i 2HrQ== X-Gm-Message-State: APjAAAUEnMhK2lMv9tnFBNUdJTcWuWuBpa1wMchzSXDbSbLqgBpFuvq7 tJ9pgpi2Nr0BjnAMGPyo9a6WbT26tPrPGobPrMCDiqep4oXE1A== X-Google-Smtp-Source: APXvYqz1DEE4dzE3mey1m8Nu9s1wVoJAFVHPme98YiW4rPOHlCANJQRWLVgy1At2T6wGTUMQFBrOdqoJ3OBlG2fWV2s= X-Received: by 2002:a24:fdc5:: with SMTP id m188mr10788432ith.50.1558574829149; Wed, 22 May 2019 18:27:09 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Qian Zhang Date: Thu, 23 May 2019 09:26:57 +0800 Message-ID: Subject: Re: Why does ZooKeeper follower shutdown itself when it can not read from leader To: user@zookeeper.apache.org, dev@zookeeper.apache.org Content-Type: multipart/alternative; boundary="0000000000000c23f5058983fb3f" --0000000000000c23f5058983fb3f Content-Type: text/plain; charset="UTF-8" Hi Andor, I am using ZooKeeper release 3.4.10. I checked the code, if follower fails to read from leader (e.g., read timeout), it will close the socket, see https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/quorum/Follower.java#L91:L85 for details. And once the socket is close, it will make follower fails to write (I guess same socket is used here) which will be treated as an severe unrecoverable error, and then shutdown follower, see https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/quorum/FollowerRequestProcessor.java#L90:L95 and https://github.com/apache/zookeeper/blob/release-3.4.10/src/java/main/org/apache/zookeeper/server/ZooKeeperCriticalThread.java#L48:L51 . So it seems shutting down follower when it cannot read from leader is the design behavior? Or if my understanding is wrong can you please let me know the design behavior in this case? Thanks! Regards, Qian Zhang On Wed, May 22, 2019 at 8:52 AM Qian Zhang wrote: > Anyone has any ideas? > > Regards, > Qian Zhang > > > On Sun, May 19, 2019 at 6:15 PM Qian Zhang wrote: > >> Hi, >> >> I have a ZooKeeper cluster which has 5 nodes. Today the leader cannot be >> connected due to a hardware issue, and then I found the 4 followers just >> shutdown, here is the logs: >> >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] WARN >>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@89] - Exception when >>> following the leader >>> java.net.SocketTimeoutException: >>> Read timed out >>> at >>> java.net.SocketInputStream.socketRead0(Native Method) >>> at >>> java.net.SocketInputStream.socketRead(SocketInputStream.java:116) >>> at >>> java.net.SocketInputStream.read(SocketInputStream.java:171) >>> at >>> java.net.SocketInputStream.read(SocketInputStream.java:141) >>> at >>> java.io.BufferedInputStream.fill(BufferedInputStream.java:246) >>> at >>> java.io.BufferedInputStream.read(BufferedInputStream.java:265) >>> at >>> java.io.DataInputStream.readInt(DataInputStream.java:387) >>> at >>> org.apache.jute.BinaryInputArchive.readInt(BinaryInputArchive.java:63) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPacket.deserialize(QuorumPacket.java:83) >>> at >>> org.apache.jute.BinaryInputArchive.readRecord(BinaryInputArchive.java:99) >>> at >>> org.apache.zookeeper.server.quorum.Learner.readPacket(Learner.java:153) >>> at >>> org.apache.zookeeper.server.quorum.Follower.followLeader(Follower.java:85) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:937) >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:NIOServerCnxnFactory@192] - >>> Accepted socket connectio >>> n from /10.249.255.10:42306 >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] WARN >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@896] - >>> Connection request from old cl >>> ient /10.249.255.10:42306; will be dropped if server is in r-o mode >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:2181:ZooKeeperServer@942] - >>> Client attempting to establish >>> new session at /10.249.255.10:42306 >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] ERROR >>> [FollowerRequestProcessor:1:ZooKeeperCriticalThread@49] - Severe >>> unrecoverable error, from threa >>> d : FollowerRequestProcessor:1 >>> java.net.SocketException: Socket >>> closed >>> at >>> java.net.SocketOutputStream.socketWrite(SocketOutputStream.java:118) >>> at >>> java.net.SocketOutputStream.write(SocketOutputStream.java:155) >>> at >>> java.io.BufferedOutputStream.flushBuffer(BufferedOutputStream.java:82) >>> at >>> java.io.BufferedOutputStream.flush(BufferedOutputStream.java:140) >>> at >>> org.apache.zookeeper.server.quorum.Learner.writePacket(Learner.java:139) >>> at >>> org.apache.zookeeper.server.quorum.Learner.request(Learner.java:188) >>> at >>> org.apache.zookeeper.server.quorum.FollowerRequestProcessor.run(FollowerRequestProcessor.java:90) >>> May 18 15:34:28 MD001076 java[29148]: [myid:1] INFO >>> [QuorumPeer[myid=1]/0:0:0:0:0:0:0:0:2181:Follower@166] - shutdown called >>> java.lang.Exception: shutdown >>> Follower >>> at >>> org.apache.zookeeper.server.quorum.Follower.shutdown(Follower.java:166) >>> at >>> org.apache.zookeeper.server.quorum.QuorumPeer.run(QuorumPeer.java:941) >> >> >> I am confused why all followers shutdown in this case which makes the >> whole ZooKeeper unusable for a short period, shouldn't they elect a new >> leader instead? Thanks! >> >> >> Regards, >> Qian Zhang >> > --0000000000000c23f5058983fb3f--