Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 311B510944 for ; Thu, 12 Sep 2013 07:06:44 +0000 (UTC) Received: (qmail 48640 invoked by uid 500); 12 Sep 2013 07:06:43 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 48206 invoked by uid 500); 12 Sep 2013 07:06:33 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 48198 invoked by uid 99); 12 Sep 2013 07:06:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2013 07:06:31 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of rakeshr@huawei.com designates 119.145.14.64 as permitted sender) Received: from [119.145.14.64] (HELO szxga01-in.huawei.com) (119.145.14.64) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 12 Sep 2013 07:06:24 +0000 Received: from 172.24.2.119 (EHLO szxeml206-edg.china.huawei.com) ([172.24.2.119]) by szxrg01-dlp.huawei.com (MOS 4.3.4-GA FastPath queued) with ESMTP id BIE42727; Thu, 12 Sep 2013 15:05:19 +0800 (CST) Received: from SZXEML410-HUB.china.huawei.com (10.82.67.137) by szxeml206-edg.china.huawei.com (172.24.2.59) with Microsoft SMTP Server (TLS) id 14.1.323.7; Thu, 12 Sep 2013 15:05:18 +0800 Received: from szxeml561-mbx.china.huawei.com ([169.254.5.162]) by szxeml410-hub.china.huawei.com ([10.82.67.137]) with mapi id 14.01.0323.007; Thu, 12 Sep 2013 15:05:13 +0800 From: Rakesh R To: "user@zookeeper.apache.org" CC: German Blanco , "michi@cs.stanford.edu" Subject: RE: adding a separate thread to detect network timeouts faster Thread-Topic: adding a separate thread to detect network timeouts faster Thread-Index: AQHOrmDaxFhMzr/WQkuyqqAEMAWJ7Zm+5o2AgAAA8gCAAAbRAIAAAWgAgAAAV4CAAJALAIAADmCAgADr74CAASYl8A== Date: Thu, 12 Sep 2013 07:05:13 +0000 Message-ID: References: <522F7A9D.20800@nicira.com> <522F8264.5090606@nicira.com> <522F8993.4020003@nicira.com> <52300E77.7080308@nicira.com> In-Reply-To: Accept-Language: en-US, zh-CN Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: x-originating-ip: [10.18.170.130] Content-Type: text/plain; charset="iso-8859-1" Content-Transfer-Encoding: quoted-printable MIME-Version: 1.0 X-CFilter-Loop: Reflected X-Virus-Checked: Checked by ClamAV on apache.org AFAIK, ping requests would not involve any disk I/O, but it would go throug= h the RequestProcessor chain and executes sequentially.=20 There could be cases when there are another set of requests which are in th= e queue for committing(say these requests needs database/disk operations). = Now a ping request has come from the client, this will be queued up at the = end of the queue. In this case, it would delay the ping request processing = and resulting in slow responses.=20 Here the server is slow due to I/O response time and affecting the client p= ing responses. Anyway after seeing the ping failure, client would look for = another server. Earlier I tried by passing ping requests from entering to RequestProcessor = chain, instead directly send response back to the client. It has disadvanta= ge of violating the requests lifecycle. Interesting point is how to differe= ntiate the slow servers and servers which are really down... -Rakesh -----Original Message----- From: mutsuzaki@gmail.com [mailto:mutsuzaki@gmail.com] On Behalf Of Michi M= utsuzaki Sent: 12 September 2013 02:07 To: user@zookeeper.apache.org Cc: German Blanco Subject: Re: adding a separate thread to detect network timeouts faster Slow disk does affect client <-> server ping requests since ping requests g= o through the commit processor. Here is how the current client <-> server ping request works. Say the sessi= on timeout is set to 30 seconds. 1. The client sends a ping request if the session has been inactive for 10 = seconds (1/3 of the session timeout). 2. The client waits for ping response for another 10 seconds (1/3 of the se= ssion timeout). 3. If the client doesn't receive ping response after 10 seconds, it tries t= o connect to another server. So in this case, it can take up to 20 seconds for the client to detect a se= rver failure. I think this 1/3 value is picked somewhat arbitrarily. Maybe = you can make this configurable for faster failure detection instead of intr= oducing another heartbeat mechanism? --Michi On Tue, Sep 10, 2013 at 11:32 PM, Jeremy Stribling wrote= : > Hi Germ=E1n, > > A very quick scan of that JIRA makes me think you're talking about > server->server heartbeats, and not client->server heartbeats (which is=20 > server->what > I'm talking about). I have not tested it explicitly or inspected that=20 > part of the code, but I've hit many cases in testing and production=20 > where client session expirations coincide with long fsync times as logged= by the server. > > Jeremy > > > On 09/10/2013 10:40 PM, German Blanco wrote: >> >> Hello Jeremy and all, >> >> my idea was that the current implementation of ping handling already=20 >> does not wait on disk IO. >> I am even working in a JIRA case that is related with this: >> https://issues.apache.org/jira/browse/ZOOKEEPER-87 >> And I have also made some tests that seem to confirm that ping=20 >> handling is done in a different thread than transaction handling. >> But actually, I don't have any confirmation from any person in this=20 >> project. Are you sure that ping handling waits on IO for anything?=20 >> Have you tested it? >> >> Regards, >> Germ=E1n Blanco. >> >> >> >> On Tue, Sep 10, 2013 at 11:05 PM, Jeremy Stribling >> wrote: >> >>> Good suggestion, thanks. At the very least, I think what we have in=20 >>> mind would be off by default, so users could only turn it on if they=20 >>> know they have relatively few clients and slow disks. An adaptive=20 >>> scheme would be even better, obviously. >>> >>> >>> On 09/10/2013 02:04 PM, Ted Dunning wrote: >>> >>>> Perhaps you should be suggesting a design that is adaptive rather=20 >>>> than configured and guarantees low overhead at the cost of=20 >>>> notification time in extreme scenarios. >>>> >>>> For instance, the server can send no more than 1000 (or whatever=20 >>>> number) HB's per second and never more than one per second to any=20 >>>> client. This caps the cost nicely. >>>> >>>> >>>> >>>> On Tue, Sep 10, 2013 at 1:59 PM, Ted Dunning >>>> >>> ted.dunning@gmail.com>**> wrote: >>>> >>>> >>>> Since you are talking about client connection failure detection, >>>> no, I don't think that there is a major barrier other than >>>> actually implementing a reliable check. >>>> >>>> Keep in mind the cost. There are ZK installs with 100,000 >>>> clients. If these are heartbeating every 2 seconds, you have >>>> 50,000 packets per second hitting the quorum or 10,000 per server >>>> if all connections are well balanced. >>>> >>>> If you only have 10 clients, the network burden is nominal. >>>> >>>> >>>> >>>> On Tue, Sep 10, 2013 at 1:34 PM, Jeremy Stribling >>>> > wrote: >>>> >>>> I mostly agree, but let's assume that a ~5x speedup in >>>> detecting those types of failures is considered significant >>>> for some people. Are there technical reasons that would >>>> prevent this idea from working? >>>> >>>> On 09/10/2013 01:31 PM, Ted Dunning wrote: >>>> >>>> I don't see the strong value here. A few failures would >>>> be detected more >>>> quickly, but I am not convinced that this would actually >>>> improve >>>> functionality significantly. >>>> >>>> >>>> On Tue, Sep 10, 2013 at 1:01 PM, Jeremy Stribling >>>> > wrote: >>>> >>>> Hi all, >>>> >>>> Let's assume that you wanted to deploy ZK in a >>>> virtualized environment, >>>> despite all of the known drawbacks. Assume we could >>>> deploy it such that >>>> the ZK servers were all using independent CPUs and >>>> storage (though not >>>> dedicated disks). Obviously, the shared disks (share= d >>>> with other, non-ZK >>>> VMs on the same hypervisor) will cause ZK to hit the >>>> default session >>>> timeout occasionally, so you would need to raise the >>>> existing session >>>> timeout to something like 30 seconds. >>>> >>>> I'm curious if there would be any technical drawbacks >>>> to adding an >>>> additional heartbeat mechanism between the clients an= d >>>> the servers, which >>>> would have the goal of detecting network-only failure= s >>>> faster than the >>>> existing heartbeat mechanism. The idea is that there >>>> would be a new thread >>>> dedicated to processing these heartbeats, which would >>>> not get blocked on >>>> I/O. Then the clients could configure a second, >>>> smaller timeout value, and >>>> it would be assumed that any such timeout indicated a >>>> real problem. The >>>> existing mechanism would still be in place to catch >>>> I/O-related errors. >>>> >>>> I understand the philosophy that there should be some >>>> heartbeat mechanism >>>> that takes the disk into account, but I'm having >>>> trouble coming up with >>>> technical reasons not to add a second mechanism. >>>> Obviously, the advantage >>>> would be that the clients could detect network >>>> failures and system crashes >>>> more quickly in an environment with slow disks, and >>>> fail over to other >>>> servers more quickly. The only disadvantages I can >>>> come up with are: >>>> >>>> 1) More code complexity, and slightly more heartbeat >>>> traffic on the wire >>>> 2) I think the servers have to log session expiration= s >>>> to disk, so if the >>>> sessions expire at a faster rate than the disk can >>>> handle, it might lead to >>>> a large backlog. >>>> >>>> Are there other drawbacks I am missing? Would a patc= h >>>> that added >>>> something like this be considered, or is it dead from >>>> the start? Thanks, >>>> >>>> Jeremy >>>> >>>> >>>> >>>> >>>> >>>> >