zookeeper-notifications mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [zookeeper] symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network
Date Mon, 12 Aug 2019 11:50:12 GMT
symat edited a comment on issue #1048: ZOOKEEPER-3188: Improve resilience to network
URL: https://github.com/apache/zookeeper/pull/1048#issuecomment-520381529
 
 
   I created a simple docker config with multiple virtual networks and managed to test the
situation when some of the containers loose the access to one of the virtual networks. I uploaded
the docker related scripts / configs here: https://github.com/symat/zookeeper-docker-test
   
   During these manual tests I found some situations when the previous patch didn't work.

   - the InitialMessage sent during the leader election contained only a single election address.
If this address was not reachable by the recipient of the InitialMessage, then the connection
was never successfully initiated. I changed the format of the InitialMessage to send all the
election addresses and the other side will use only the one which is reachable.
   - When an existing tcp connection to an electionAddress is broken, the server will try
to send notification messages re-using the existing SendWorker threads. I would assume that
the SendWorker.send() method should die when it tries to flush the output stream on the socket
which destination is already unreachable. However, for some reason it doesn't die. (this could
be investigated further) Anyway, I added a small logic for the connection initiation to verify
if the existing destination in SendWorker is still reachable. If the destination is unreachable
in the SendWorker thread, then we can gracefully finish it and during the next connection
attempt we will choose a destination what is reachable. (this part I fixed in a second commit)
   
   With these modifications I was able to test the following situation successfully:
   
   1. starting a zookeeper with nodes, each server listening on two addresses (on two separate
virtual networks)
   2. waiting the initial leader election to happen
   3. removing the current leader from the virtual network that is used by the others as destination
   4. it took a few seconds until all the servers recognised the loss of connections, and
in 5-10 seconds the connections were re-established and the new leader election finished
   
   I will think how to unittest these features. (or should we crate some docker-based automated
integration test?)
   
   In the mean while I would appreciate a deep review of these changes, as I am quite new
in the Zookeeper code...

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message