zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Sergei Babovich <sbabov...@demandware.com>
Subject Re: Zookeeper delay to reconnect
Date Fri, 28 Sep 2012 15:15:29 GMT
Thanks, Patrick!
On 09/27/2012 07:55 PM, Patrick Hunt wrote:
> The random sleep was explicitly added to reduce herd effects and
> general "spinning client" problems iirc. Keep in mind that ZK
> generally trades of performance for availability.
That's exactly my concern - it is not about performance - from the 
client's point of view having reconnect delay makes cluster effectively 
unavailable for up to a second. In a scenarios when you have relatively 
low number of sessions (herding is not a concern) with each session 
processing a lot of requests such strategy potentially causes 
instability - there is no way to gracefully handle intermittent errors 
caused by normal operation procedures without risking client's stability.
> It wouldn't be a
> good idea to remove it in general. If anything we should have a more
> aggressive backoff policy in the case where clients are just spinning.
> Perhaps a plug-able approach here? Where the default is something like
> what we already have, but allow users to implement their own policy if
> they like. We could have a few implementations "out of the box"; 1)
> current, 2) no wait, 3) exponential backoff after trying each server
> in the ensemble, etc... This would also allow for experimentation.
Totally agree - customizable strategy should be an answer to facilitate 
different requirements.
Just curious: does randomized delay make a real difference here? Was it 
a real issue somebody hit? I'd expect that randomizing server address to 
reconnect should be enough - the load will be evenly distributed across 
the rest of the cluster node and should not create a problem assuming 
enough zookeeper cluster capacity.
> Patrick
> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <michi@cs.stanford.edu> wrote:
>> Hi Sergei,
>> Your suggestion sounds reasonable to me. I think the sleep was added
>> so that the client doesn't spin when the entire zookeeper is down. The
>> client could try to connect to each server without sleep, and sleep
>> for 1 second only after failing to connect to all the servers in the
>> cluster.
>> Thanks!
>> --Michi
>> On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
>> <sbabovich@demandware.com> wrote:
>>> Hi,
>>> Zookeeper implements a delay of up to 1 second before trying to reconnect.
>>> ClientCnxn$SendThread
>>>          @Override
>>>          public void run() {
>>>              ...
>>>              while (state.isAlive()) {
>>>                  try {
>>>                      if (!clientCnxnSocket.isConnected()) {
>>>                          if(!isFirstConnect){
>>>                              try {
>>>                                  Thread.sleep(r.nextInt(1000));
>>>                              } catch (InterruptedException e) {
>>>                                  LOG.warn("Unexpected exception", e);
>>>                              }
>>> This creates "outages" (even with simple retry on ConnectionLoss) up to 1s
>>> even with perfectly healthy cluster like in scenario of rolling restart. In
>>> our scenario it might be a problem under high load creating a spike in a
>>> number of requests waiting on zk operation.
>>> Would it be a better strategy to perform reconnect attempt immediately at
>>> least one time? Or there is more to it?

This e-mail message and all attachments transmitted with it may contain privileged and/or
confidential information intended solely for the use of the addressee(s). If the reader of
this message is not the intended recipient, you are hereby notified that any reading, dissemination,
distribution, copying, forwarding or other use of this message or its attachments is strictly
prohibited. If you have received this message in error, please notify the sender immediately
and delete this message, all attachments and all copies and backups thereof.

View raw message