zookeeper-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From GitBox <...@apache.org>
Subject [GitHub] [zookeeper] vladimirivic commented on issue #953: ZOOKEEPER-3398 Learner.connectToLeader() may take too long to time-out
Date Tue, 21 May 2019 22:28:07 GMT
vladimirivic commented on issue #953: ZOOKEEPER-3398 Learner.connectToLeader() may take too
long to time-out 
URL: https://github.com/apache/zookeeper/pull/953#issuecomment-494582370
 
 
   > Thanks for you contribution @vladimirivic .
   > 
   > I don't fully understand what's the reasoning behind this change. It looks like you're
replacing `initLimit` with a new config value which doesn't make sense to me. Why not setting
`initLimit` to a lower value?
   
   That's a great question. Let me try to elaborate why this config option. 
   
   In the case of Learner.connectToLeader(), initLimit is used to timeout solely the socket
connect operation. Because this method retries up to 5 times, it may leave an ensemble without
quorum for too long. Let's say I was using initLimit=20. In case of network issues, this value
would leave ensemble without a quorum for almost 2 minutes.
   
   The problem can be remediated with use of smaller initLimit value. And I can set it to
initLimit=10, or even initLimit=5 so when there are network issues and if we wait for all
5 reties we don't spend to much time without having the quorum. 
   
   That would solve the problem for the socket but then it may cause issues inside Leader.lead()
because initLimit is used by these three methods after the leader gets elected and starts
leading:
     - Leader.getEpochToPropose()
     - Leader.waitForEpochAck()
     - Leader.waitForNewLeaderAck()
   
   If we go with initLimit smaller than what we tolerate, then that will increase the number
of the Exceptions when using ensembles that are not in the same region which will cause quorum
issues. Particularly these three from Leader.java: 
     - Timeout while waiting for NEWLEADER to be acked by quorum
     - Timeout while waiting for epoch from quorum 
     - Timeout while waiting for epoch to be acked by quorum
   
   I think it would be good to have a separate config to control socket connection timeout
without affecting the quorum timeout so they don't hinder each other.
   
   Hopefully this helps shed some light with this pull request. Also, if this sounds like
a good idea, maybe we should take into consideration other socket connect calls and use unified
config name to address this. Let me know what you think. 
   
   You can also touch base with @lvfangmin for more specific details about the problems we
have seen in production.
   
   

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on to GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


With regards,
Apache Git Services

Mime
View raw message