From notifications-return-2196-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org Wed Oct 9 13:54:59 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 72FB7180645 for ; Wed, 9 Oct 2019 15:54:59 +0200 (CEST) Received: (qmail 74977 invoked by uid 500); 9 Oct 2019 13:54:58 -0000 Mailing-List: contact notifications-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@zookeeper.apache.org Delivered-To: mailing list notifications@zookeeper.apache.org Received: (qmail 74964 invoked by uid 99); 9 Oct 2019 13:54:58 -0000 Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 13:54:58 +0000 From: GitBox To: notifications@zookeeper.apache.org Subject: [GitHub] [zookeeper] anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network Message-ID: <157062929876.7081.6980071632254453505.gitbox@gitbox.apache.org> Date: Wed, 09 Oct 2019 13:54:58 -0000 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 8bit anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540011376 I uploaded the logs of the failing Follower here: https://pastebin.com/LsXYiRKt It was running on a Mac and the situation was as previously described: 1. 2 interfaces was running: wifi and cable, 2. cable plugged out, 3. wifi got disabled, cable plugged in After the 3rd step we had to wait approximately 1 minute for the quorum to get up again. We believe that it was because at the first exception: ``` 2019-10-09 13:49:43,744 [myid:1] - WARN [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@127] - Exception when following the leader java.net.SocketTimeoutException: Read timed out ``` Follower shuts down, restarting the leader election, but `QuorumCnxnManager` still believes the connections are still up. After a minute it finally gets SocketException here: ``` 2019-10-09 13:50:37,709 [myid:1] - WARN [RecvWorker:3:QuorumCnxManager$RecvWorker@1336] - Connection broken for id 3, my id = 1, error = java.net.SocketException: Operation timed out (Read failed) ``` and shuts down all Senc/Recv workers. This is because the read timeout on that socket is infinite to prevent the leader election port shutdown when no traffic is transmitted. At this point the leader election raised the notification timeout to approx. 1 minute, so we have to wait for notifications to be resent quite long. If only a single node is failing, the quorum is still up, so I believe it's not a big deal. But if we think about an entire switch failure which could shutdown the entire ensemble at the same time, this could be too long to recover. ---------------------------------------------------------------- This is an automated message from the Apache Git Service. To respond to the message, please log on to GitHub and use the URL above to go to the specific comment. For queries about this service, please contact Infrastructure at: users@infra.apache.org With regards, Apache Git Services