zookeeper-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [zookeeper] anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network
Date Wed, 09 Oct 2019 13:54:58 GMT
anmolnar edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540011376
 
 
   I uploaded the logs of the failing Follower here: https://pastebin.com/LsXYiRKt
   
   It was running on a Mac and the situation was as previously described:
   1. 2 interfaces was running: wifi and cable,
   2. cable plugged out,
   3. wifi got disabled, cable plugged in
   
   After the 3rd step we had to wait approximately 1 minute for the quorum to get up again.
We believe that it was because at the first exception:
   ```
   2019-10-09 13:49:43,744 [myid:1] - WARN  [QuorumPeer[myid=1](plain=[0:0:0:0:0:0:0:0]:2181)(secure=disabled):Follower@127]
- Exception when following the leader
   java.net.SocketTimeoutException: Read timed out
   ```
   Follower shuts down, restarting the leader election, but `QuorumCnxnManager` still believes
the connections are still up. After a minute it finally gets SocketException here:
   ```
   2019-10-09 13:50:37,709 [myid:1] - WARN  [RecvWorker:3:QuorumCnxManager$RecvWorker@1336]
- Connection broken for id 3, my id = 1, error =
   java.net.SocketException: Operation timed out (Read failed)
   ```
   and shuts down all Senc/Recv workers. This is because the read timeout on that socket is
infinite to prevent the leader election port shutdown when no traffic is transmitted. At this
point the leader election raised the notification timeout to approx. 1 minute, so we have
to wait for notifications to be resent quite long.
   
   If only a single node is failing, the quorum is still up, so I believe it's not a big deal.
But if we think about an entire switch failure which could shutdown the entire ensemble at
the same time, this could be too long to recover.

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message