From notifications-return-2196-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org  Wed Oct  9 13:54:59 2019
Return-Path: <notifications-return-2196-archive-asf-public=cust-asf.ponee.io@zookeeper.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [207.244.88.153])
	by mx-eu-01.ponee.io (Postfix) with SMTP id 72FB7180645
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  9 Oct 2019 15:54:59 +0200 (CEST)
Received: (qmail 74977 invoked by uid 500); 9 Oct 2019 13:54:58 -0000
Mailing-List: contact notifications-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:notifications-help@zookeeper.apache.org>
List-Unsubscribe: <mailto:notifications-unsubscribe@zookeeper.apache.org>
List-Post: <mailto:notifications@zookeeper.apache.org>
List-Id: <notifications.zookeeper.apache.org>
Reply-To: dev@zookeeper.apache.org
Delivered-To: mailing list notifications@zookeeper.apache.org
Received: (qmail 74964 invoked by uid 99); 9 Oct 2019 13:54:58 -0000
Received: from ec2-52-202-80-70.compute-1.amazonaws.com (HELO gitbox.apache.org) (52.202.80.70)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Oct 2019 13:54:58 +0000
From: GitBox <git@apache.org>
To: notifications@zookeeper.apache.org
Subject: [GitHub] [zookeeper] anmolnar edited a comment on issue #1048:
 ZOOKEEPER-3188: Improve resilience to network
Message-ID: <157062929876.7081.6980071632254453505.gitbox@gitbox.apache.org>
Date: Wed, 09 Oct 2019 13:54:58 -0000
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 8bit

anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540011376
 
 
   I uploaded the logs of the failing Follower here: https://pastebin.com/LsXYiRKt
   
   It was running on a Mac and the situation was as previously described:
   1. 2 interfaces was running: wifi and cable,
   2. cable plugged out,
   3. wifi got disabled, cable plugged in
   
   After the 3rd step we had to wait approximately 1 minute for the quorum to get up again. We believe that it was because at the first exception:
   ```
   2019-10-09 13:49:43,744 [myid:1] - WARN  [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@127] - Exception when following the leader
   java.net.SocketTimeoutException: Read timed out
   ```
   Follower shuts down, restarting the leader election, but `QuorumCnxnManager` still believes the connections are still up. After a minute it finally gets SocketException here:
   ```
   2019-10-09 13:50:37,709 [myid:1] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@1336] - Connection broken for id 3, my id = 1, error =
   java.net.SocketException: Operation timed out (Read failed)
   ```
   and shuts down all Senc/Recv workers. This is because the read timeout on that socket is infinite to prevent the leader election port shutdown when no traffic is transmitted. At this point the leader election raised the notification timeout to approx. 1 minute, so we have to wait for notifications to be resent quite long.
   
   If only a single node is failing, the quorum is still up, so I believe it's not a big deal. But if we think about an entire switch failure which could shutdown the entire ensemble at the same time, this could be too long to recover.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services