zookeeper-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [zookeeper] symat commented on issue #1048: ZOOKEEPER-3188: Improve resilience to network
Date Wed, 09 Oct 2019 13:38:33 GMT
symat commented on issue #1048: ZOOKEEPER-3188: Improve resilience to network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-540004845
   In my last commit I uploaded a fix for the BindException issue @anmolnar found (I implemented
his proposal in the Leader's constructor). I also modified a unit test to cover this case
as well.
   We did some manual testing with @anmolnar on the latest version. The patch is working,
now we can pull-out and plug back the different cables / wifi and the quorum keeps to survive.
However, the recovery is a bit long (around 1 minute). The recovery when executing the same
tests with linux in docker with virtual networks and interfaces (using the same config) takes
much shorter time (~10-15 seconds). It looks like that in case of the docker/linux test, the
socket in the `QuorumCnxManager.RecvWorker` dies much quicker by a 'SocketException: Socket
closed`, while in the same test with real mac notebooks the same socket dies later due to
`SocketException: Operation timed out (Read failed)`.
   We think with @anmolnar that we found a way to detect the failure quicker in the second
case, but that still needs to be tested. I will work on this later (although I think this
might have a lower priority, we can even close this PR without such optimization).
   I think the upgrade / TLS / kerberos related manual tests are more important at the moment.

This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
For queries about this service, please contact Infrastructure at:

With regards,
Apache Git Services

View raw message