zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Patrick Hunt <ph...@apache.org>
Subject Re: Zookeeper delay to reconnect
Date Thu, 27 Sep 2012 23:55:08 GMT
The random sleep was explicitly added to reduce herd effects and
general "spinning client" problems iirc. Keep in mind that ZK
generally trades of performance for availability. It wouldn't be a
good idea to remove it in general. If anything we should have a more
aggressive backoff policy in the case where clients are just spinning.

Perhaps a plug-able approach here? Where the default is something like
what we already have, but allow users to implement their own policy if
they like. We could have a few implementations "out of the box"; 1)
current, 2) no wait, 3) exponential backoff after trying each server
in the ensemble, etc... This would also allow for experimentation.


On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <michi@cs.stanford.edu> wrote:
> Hi Sergei,
> Your suggestion sounds reasonable to me. I think the sleep was added
> so that the client doesn't spin when the entire zookeeper is down. The
> client could try to connect to each server without sleep, and sleep
> for 1 second only after failing to connect to all the servers in the
> cluster.
> Thanks!
> --Michi
> On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
> <sbabovich@demandware.com> wrote:
>> Hi,
>> Zookeeper implements a delay of up to 1 second before trying to reconnect.
>> ClientCnxn$SendThread
>>         @Override
>>         public void run() {
>>             ...
>>             while (state.isAlive()) {
>>                 try {
>>                     if (!clientCnxnSocket.isConnected()) {
>>                         if(!isFirstConnect){
>>                             try {
>>                                 Thread.sleep(r.nextInt(1000));
>>                             } catch (InterruptedException e) {
>>                                 LOG.warn("Unexpected exception", e);
>>                             }
>> This creates "outages" (even with simple retry on ConnectionLoss) up to 1s
>> even with perfectly healthy cluster like in scenario of rolling restart. In
>> our scenario it might be a problem under high load creating a spike in a
>> number of requests waiting on zk operation.
>> Would it be a better strategy to perform reconnect attempt immediately at
>> least one time? Or there is more to it?

View raw message