Hi Brian, well, in my proposal the default would be the current
behavior. With the discretion of the zk operator to change, so it
shouldn't be any worse.
You've piqued my interest - a single client attempting to connect is
responsible for bringing down the entire cluster? Could you provide
more details?
Patrick
On Thu, Sep 27, 2012 at 4:58 PM, Brian Tarbox <briantarbox@gmail.com> wrote:
> I would lobby not to change this...I'm still occasionally dealing with
> clients spinning trying to connect...which brings down the whole cluster
> until that one client is killed.
>
> Brian
>
> On Thu, Sep 27, 2012 at 7:55 PM, Patrick Hunt <phunt@apache.org> wrote:
>
>> The random sleep was explicitly added to reduce herd effects and
>> general "spinning client" problems iirc. Keep in mind that ZK
>> generally trades of performance for availability. It wouldn't be a
>> good idea to remove it in general. If anything we should have a more
>> aggressive backoff policy in the case where clients are just spinning.
>>
>> Perhaps a plug-able approach here? Where the default is something like
>> what we already have, but allow users to implement their own policy if
>> they like. We could have a few implementations "out of the box"; 1)
>> current, 2) no wait, 3) exponential backoff after trying each server
>> in the ensemble, etc... This would also allow for experimentation.
>>
>> Patrick
>>
>> On Thu, Sep 27, 2012 at 2:28 PM, Michi Mutsuzaki <michi@cs.stanford.edu>
>> wrote:
>> > Hi Sergei,
>> >
>> > Your suggestion sounds reasonable to me. I think the sleep was added
>> > so that the client doesn't spin when the entire zookeeper is down. The
>> > client could try to connect to each server without sleep, and sleep
>> > for 1 second only after failing to connect to all the servers in the
>> > cluster.
>> >
>> > Thanks!
>> > --Michi
>> >
>> > On Thu, Sep 27, 2012 at 1:34 PM, Sergei Babovich
>> > <sbabovich@demandware.com> wrote:
>> >> Hi,
>> >> Zookeeper implements a delay of up to 1 second before trying to
>> reconnect.
>> >>
>> >> ClientCnxn$SendThread
>> >> @Override
>> >> public void run() {
>> >> ...
>> >> while (state.isAlive()) {
>> >> try {
>> >> if (!clientCnxnSocket.isConnected()) {
>> >> if(!isFirstConnect){
>> >> try {
>> >> Thread.sleep(r.nextInt(1000));
>> >> } catch (InterruptedException e) {
>> >> LOG.warn("Unexpected exception", e);
>> >> }
>> >>
>> >> This creates "outages" (even with simple retry on ConnectionLoss) up to
>> 1s
>> >> even with perfectly healthy cluster like in scenario of rolling
>> restart. In
>> >> our scenario it might be a problem under high load creating a spike in a
>> >> number of requests waiting on zk operation.
>> >> Would it be a better strategy to perform reconnect attempt immediately
>> at
>> >> least one time? Or there is more to it?
>>
>
>
>
> --
> http://about.me/BrianTarbox
|