Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (nike.apache.org: domain of rakeshr@huawei.com designates
 119.145.14.64 as permitted sender)
From: Rakesh R <rakeshr@huawei.com>
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
CC: German Blanco <german.blanco.blanco@gmail.com>,
        "michi@cs.stanford.edu"
	<michi@cs.stanford.edu>
Subject: RE: adding a separate thread to detect network timeouts faster
Thread-Topic: adding a separate thread to detect network timeouts faster
Thread-Index: 
 AQHOrmDaxFhMzr/WQkuyqqAEMAWJ7Zm+5o2AgAAA8gCAAAbRAIAAAWgAgAAAV4CAAJALAIAADmCAgADr74CAASYl8A==
Date: Thu, 12 Sep 2013 07:05:13 +0000
Message-ID: 
 <C2496325850AA74C92AAF83AA9662D2631B8D8D8@szxeml561-mbx.china.huawei.com>
References: <522F7A9D.20800@nicira.com>
	<CAJwFCa0qGcrm-LM5tryf4eNcGp_=ouMF7_tuAuXNy7Q654-A=w@mail.gmail.com>
	<522F8264.5090606@nicira.com>
	<CAJwFCa0D8wM0U-tq-zPHFVk4Aa8Ze0DhaKms2_wCbdX=s-T=bA@mail.gmail.com>
	<CAJwFCa08+QZoy12Q12y6TLHi0tL74+UGKDdUXYT7nsBY2H11ZQ@mail.gmail.com>
	<522F8993.4020003@nicira.com>
	<CAEH-zfqE=d2LVdfk55UVjZ+jMFzgh2B=rE-whBd9p4jnVv4EgQ@mail.gmail.com>
	<52300E77.7080308@nicira.com>
 <CAHYqJpELXPJ2M=Bg-Y9WWWswuOkG_p_GA0rj7ykUJZOu3FvBCw@mail.gmail.com>
In-Reply-To: 
 <CAHYqJpELXPJ2M=Bg-Y9WWWswuOkG_p_GA0rj7ykUJZOu3FvBCw@mail.gmail.com>
Accept-Language: en-US, zh-CN
Content-Language: en-US
Content-Type: text/plain; charset="iso-8859-1"
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0


AFAIK, ping requests would not involve any disk I/O, but it would go throug=
h the RequestProcessor chain and executes sequentially.=20
There could be cases when there are another set of requests which are in th=
e queue for committing(say these requests needs database/disk operations). =
Now a ping request has come from the client, this will be queued up at the =
end of the queue. In this case, it would delay the ping request processing =
and resulting in slow responses.=20

Here the server is slow due to I/O response time and affecting the client p=
ing responses. Anyway after seeing the ping failure, client would look for =
another server.

Earlier I tried by passing ping requests from entering to RequestProcessor =
chain, instead directly send response back to the client. It has disadvanta=
ge of violating the requests lifecycle. Interesting point is how to differe=
ntiate the slow servers and servers which are really down...


-Rakesh

-----Original Message-----
From: mutsuzaki@gmail.com [mailto:mutsuzaki@gmail.com] On Behalf Of Michi M=
utsuzaki
Sent: 12 September 2013 02:07
To: user@zookeeper.apache.org
Cc: German Blanco
Subject: Re: adding a separate thread to detect network timeouts faster

Slow disk does affect client <-> server ping requests since ping requests g=
o through the commit processor.

Here is how the current client <-> server ping request works. Say the sessi=
on timeout is set to 30 seconds.

1. The client sends a ping request if the session has been inactive for 10 =
seconds (1/3 of the session timeout).
2. The client waits for ping response for another 10 seconds (1/3 of the se=
ssion timeout).
3. If the client doesn't receive ping response after 10 seconds, it tries t=
o connect to another server.

So in this case, it can take up to 20 seconds for the client to detect a se=
rver failure. I think this 1/3 value is picked somewhat arbitrarily. Maybe =
you can make this configurable for faster failure detection instead of intr=
oducing another heartbeat mechanism?

--Michi


On Tue, Sep 10, 2013 at 11:32 PM, Jeremy Stribling <strib@nicira.com> wrote=
:
> Hi Germ=E1n,
>
> A very quick scan of that JIRA makes me think you're talking about
> server->server heartbeats, and not client->server heartbeats (which is=20
> server->what
> I'm talking about).  I have not tested it explicitly or inspected that=20
> part of the code, but I've hit many cases in testing and production=20
> where client session expirations coincide with long fsync times as logged=
 by the server.
>
> Jeremy
>
>
> On 09/10/2013 10:40 PM, German Blanco wrote:
>>
>> Hello Jeremy and all,
>>
>> my idea was that the current implementation of ping handling already=20
>> does not wait on disk IO.
>> I am even working in a JIRA case that is related with this:
>> https://issues.apache.org/jira/browse/ZOOKEEPER-87
>> And I have also made some tests that seem to confirm that ping=20
>> handling is done in a different thread than transaction handling.
>> But actually, I don't have any confirmation from any person in this=20
>> project. Are you sure that ping handling waits on IO for anything?=20
>> Have you tested it?
>>
>> Regards,
>> Germ=E1n Blanco.
>>
>>
>>
>> On Tue, Sep 10, 2013 at 11:05 PM, Jeremy Stribling <strib@nicira.com>
>> wrote:
>>
>>> Good suggestion, thanks.  At the very least, I think what we have in=20
>>> mind would be off by default, so users could only turn it on if they=20
>>> know they have relatively few clients and slow disks.  An adaptive=20
>>> scheme would be even better, obviously.
>>>
>>>
>>> On 09/10/2013 02:04 PM, Ted Dunning wrote:
>>>
>>>> Perhaps you should be suggesting a design that is adaptive rather=20
>>>> than configured and guarantees low overhead at the cost of=20
>>>> notification time in extreme scenarios.
>>>>
>>>> For instance, the server can send no more than 1000 (or whatever=20
>>>> number) HB's per second and never more than one per second to any=20
>>>> client.  This caps the cost nicely.
>>>>
>>>>
>>>>
>>>> On Tue, Sep 10, 2013 at 1:59 PM, Ted Dunning
>>>> <ted.dunning@gmail.com<mailto:
>>>> ted.dunning@gmail.com>**> wrote:
>>>>
>>>>
>>>>      Since you are talking about client connection failure detection,
>>>>      no, I don't think that there is a major barrier other than
>>>>      actually implementing a reliable check.
>>>>
>>>>      Keep in mind the cost.  There are ZK installs with 100,000
>>>>      clients.  If these are heartbeating every 2 seconds, you have
>>>>      50,000 packets per second hitting the quorum or 10,000 per server
>>>>      if all connections are well balanced.
>>>>
>>>>      If you only have 10 clients, the network burden is nominal.
>>>>
>>>>
>>>>
>>>>      On Tue, Sep 10, 2013 at 1:34 PM, Jeremy Stribling
>>>>      <strib@nicira.com <mailto:strib@nicira.com>> wrote:
>>>>
>>>>          I mostly agree, but let's assume that a ~5x speedup in
>>>>          detecting those types of failures is considered significant
>>>>          for some people. Are there technical reasons that would
>>>>          prevent this idea from working?
>>>>
>>>>          On 09/10/2013 01:31 PM, Ted Dunning wrote:
>>>>
>>>>              I don't see the strong value here.  A few failures would
>>>>              be detected more
>>>>              quickly, but I am not convinced that this would actually
>>>>              improve
>>>>              functionality significantly.
>>>>
>>>>
>>>>              On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling
>>>>              <strib@nicira.com <mailto:strib@nicira.com>> wrote:
>>>>
>>>>                  Hi all,
>>>>
>>>>                  Let's assume that you wanted to deploy ZK in a
>>>>                  virtualized environment,
>>>>                  despite all of the known drawbacks.  Assume we could
>>>>                  deploy it such that
>>>>                  the ZK servers were all using independent CPUs and
>>>>                  storage (though not
>>>>                  dedicated disks).  Obviously, the shared disks (share=
d
>>>>                  with other, non-ZK
>>>>                  VMs on the same hypervisor) will cause ZK to hit the
>>>>                  default session
>>>>                  timeout occasionally, so you would need to raise the
>>>>                  existing session
>>>>                  timeout to something like 30 seconds.
>>>>
>>>>                  I'm curious if there would be any technical drawbacks
>>>>                  to adding an
>>>>                  additional heartbeat mechanism between the clients an=
d
>>>>                  the servers, which
>>>>                  would have the goal of detecting network-only failure=
s
>>>>                  faster than the
>>>>                  existing heartbeat mechanism.  The idea is that there
>>>>                  would be a new thread
>>>>                  dedicated to processing these heartbeats, which would
>>>>                  not get blocked on
>>>>                  I/O.  Then the clients could configure a second,
>>>>                  smaller timeout value, and
>>>>                  it would be assumed that any such timeout indicated a
>>>>                  real problem.  The
>>>>                  existing mechanism would still be in place to catch
>>>>                  I/O-related errors.
>>>>
>>>>                  I understand the philosophy that there should be some
>>>>                  heartbeat mechanism
>>>>                  that takes the disk into account, but I'm having
>>>>                  trouble coming up with
>>>>                  technical reasons not to add a second mechanism.
>>>>                  Obviously, the advantage
>>>>                  would be that the clients could detect network
>>>>                  failures and system crashes
>>>>                  more quickly in an environment with slow disks, and
>>>>                  fail over to other
>>>>                  servers more quickly.  The only disadvantages I can
>>>>                  come up with are:
>>>>
>>>>                  1) More code complexity, and slightly more heartbeat
>>>>                  traffic on the wire
>>>>                  2) I think the servers have to log session expiration=
s
>>>>                  to disk, so if the
>>>>                  sessions expire at a faster rate than the disk can
>>>>                  handle, it might lead to
>>>>                  a large backlog.
>>>>
>>>>                  Are there other drawbacks I am missing?  Would a patc=
h
>>>>                  that added
>>>>                  something like this be considered, or is it dead from
>>>>                  the start? Thanks,
>>>>
>>>>                  Jeremy
>>>>
>>>>
>>>>
>>>>
>>>>
>>>>
>