Return-Path: Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: (qmail 16979 invoked from network); 11 Feb 2011 23:10:47 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 11 Feb 2011 23:10:47 -0000 Received: (qmail 86653 invoked by uid 500); 11 Feb 2011 23:10:47 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 86543 invoked by uid 500); 11 Feb 2011 23:10:47 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 86535 invoked by uid 99); 11 Feb 2011 23:10:47 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Feb 2011 23:10:47 +0000 X-ASF-Spam-Status: No, hits=-5.0 required=5.0 tests=RCVD_IN_DNSWL_HI,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of Nicholas.Patania@morganstanley.com designates 199.89.103.68 as permitted sender) Received: from [199.89.103.68] (HELO pimtaint01.ms.com) (199.89.103.68) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 11 Feb 2011 23:10:38 +0000 Received: from pimtaint01.ms.com (localhost.ms.com [127.0.0.1]) by pimtaint01.ms.com (output Postfix) with ESMTP id 11F16304864 for ; Fri, 11 Feb 2011 18:10:18 -0500 (EST) X-Anti-Virus: Kaspersky Anti-Virus for Linux Mail Server 5.6.44/RELEASE, bases: 20110211 #4788812, check: 20110211 clean Received: from ny0031as02 (unknown [170.74.93.53]) by pimtaint01.ms.com (internal Postfix) with ESMTP id 101B2304861 for ; Fri, 11 Feb 2011 18:10:18 -0500 (EST) Received: from ny0031as02 (localhost [127.0.0.1]) by ny0031as02 (msa-out Postfix) with ESMTP id F2C90E98232 for ; Fri, 11 Feb 2011 18:10:17 -0500 (EST) Received: from hqmsaext01.ms.com (hqmsaext01 [205.228.53.70]) by ny0031as02 (mta-in Postfix) with ESMTP id EED2C758037 for ; Fri, 11 Feb 2011 18:10:17 -0500 (EST) Received: from hqmsaext01.ms.com (localhost.ms.com [127.0.0.1]) by hqmsaext01.ms.com (output Postfix) with ESMTP id DFFAE146C06D for ; Fri, 11 Feb 2011 18:10:17 -0500 (EST) X-Anti-Virus: Kaspersky Anti-Virus for Linux Mail Server 5.6.44/RELEASE, bases: 20110211 #4788812, check: 20110211 clean Received: from mail-bw0-f41.google.com (mail-bw0-f41.google.com [209.85.214.41]) (using TLSv1 with cipher RC4-SHA (128/128 bits)) (Client CN "smtp.gmail.com", Issuer "Google Internet Authority" (verified OK)) by hqmsaext01.ms.com (submission Postfix) with ESMTPS id 96ABA146C067 for ; Fri, 11 Feb 2011 18:10:17 -0500 (EST) Received: by bwz16 with SMTP id 16so3672422bwz.0 for ; Fri, 11 Feb 2011 15:10:16 -0800 (PST) Received: by 10.204.124.6 with SMTP id s6mr984165bkr.51.1297465815958; Fri, 11 Feb 2011 15:10:15 -0800 (PST) MIME-Version: 1.0 Received: by 10.204.102.132 with HTTP; Fri, 11 Feb 2011 15:09:55 -0800 (PST) In-Reply-To: <69D3016305F9084FBD2C4A0DF189BD5C16B7F384FD@GSCMAMP02EX.firmwide.corp.gs.com> References: <6642FC1CAF133548AA8FDF497C547F0A23C0C5265B@NYWEXMBX2126.msad.ms.com> <4D548CC9.6090102@yahoo-inc.com> <69D3016305F9084FBD2C4A0DF189BD5C16B7F384FD@GSCMAMP02EX.firmwide.corp.gs.com> From: Nick Patania Date: Fri, 11 Feb 2011 18:09:55 -0500 Message-ID: Subject: Re: tickTime and sessionTimeout To: user@zookeeper.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org A single server. I've intentionally made it trivial to demonstrate the behavior. If I increase the timeout, the issue goes away. On Fri, Feb 11, 2011 at 6:00 PM, Fournier, Camille F. [Tech] wrote: > What is your ZooKeeper setup here? And do you continue to see this issue = if you increase your session timeout? > > C > > -----Original Message----- > From: Nick Patania [mailto:Nicholas.Patania@morganstanley.com] > Sent: Friday, February 11, 2011 5:24 PM > To: user@zookeeper.apache.org > Subject: Re: tickTime and sessionTimeout > > Patrick, thanks for your input. > > I have rerun the test several times now while logging GC on the server > and running ping from CLIENT_1: > > =A0- No GC happens on the server during the period of interest (a > couple of young generation runs happen before I kill HOST_0, and they > complete in under 3ms). > =A0- Round trip times for ping from CLIENT_1 are consistently under > 250us throughout. > > Regarding client GC -- I can consistently reproduce this using a C > client. =A0Regarding the theory of swapping on CLIENT_1 -- if that were > the cause, the problem wouldn't be 100% reproducible. =A0I also looked > through the client log for SESSION_1 -- I see "Got ping response ... > after 1ms" repeatedly, followed by "Client session timed out, have not > heard from server in 666ms"... > > On Fri, Feb 11, 2011 at 1:46 PM, Patrick Hunt wrote: >> Those are pretty short timeouts, many sources of delay could be >> causing this. Network jitter/latency, GC/swap (server or client), IO >> write latency, etc... See if any of this might be your issue: >> https://cwiki.apache.org/confluence/display/ZOOKEEPER/Troubleshooting >> >> I can clearly see this sequence in your log for session1: >> >> --- >> 2011-02-11 14:18:52,672 - =A0sessionid:SESSION_1 type:ping >> >> 2011-02-11 14:18:54,502 - Expiring session SESSION_1, timeout of 1000ms = exceeded >> >> 2011-02-11 14:18:55,011 - Processing request:: sessionid:SESSION_1 type:= ping >> --- >> >> from the looks of it session 1 doesn't send a ping to the server for >> ~2.5 seconds, as a result it's expired. >> >> You should also look at your session 1 client log and see what it's >> view of the world is like. (is it gc/swapping?). Use the "stat" 4 >> letter word to see the server's request processing latency, see if >> that's high (higer than the timeout is bad news). Again, checkout the >> troubleshooting guide. >> >> Patrick >> >> ps please use pastebin or attachment, otw the formatting of wrecked >> and it's harder to read the log >> >> On Fri, Feb 11, 2011 at 6:41 AM, Nick Patania >> wrote: >>> This is the portion that seems relevant. =A0For readability, I replaced >>> the host and session for the host that I kill with HOST_0 and >>> SESSION_0 (I expect these to timeout). =A0The client that should be >>> healthy is HOST_1 and SESSION_1. >>> >>> >>> 2011-02-11 14:18:51,901 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_0 type:ping cxid:0xfffffffffffffffe >>> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:51,901 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@160] - sessionid:SESSION_0 >>> type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe >>> txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,005 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_1 type:ping cxid:0xfffffffffffffffe >>> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,005 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@160] - sessionid:SESSION_1 >>> type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe >>> txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,339 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_1 type:ping cxid:0xfffffffffffffffe >>> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,339 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@160] - sessionid:SESSION_1 >>> type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe >>> txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,672 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_1 type:ping cxid:0xfffffffffffffffe >>> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:52,672 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@160] - sessionid:SESSION_1 >>> type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe >>> txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:53,002 - INFO =A0[SessionTracker:ZooKeeperServer@314] = - >>> Expiring session SESSION_0, timeout of 1000ms exceeded >>> 2011-02-11 14:18:53,002 - INFO >>> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session >>> termination for sessionid: SESSION_0 >>> 2011-02-11 14:18:53,010 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_0 type:closeSession cxid:0x0 zxid:0x103 txntype:-11 >>> reqpath:n/a >>> 2011-02-11 14:18:53,010 - INFO =A0[SyncThread:0:NIOServerCnxn@1435] - >>> Closed socket connection for client /HOST_0:34618 which had sessionid >>> SESSION_0 >>> 2011-02-11 14:18:54,502 - INFO =A0[SessionTracker:ZooKeeperServer@314] = - >>> Expiring session SESSION_1, timeout of 1000ms exceeded >>> 2011-02-11 14:18:54,502 - INFO >>> [ProcessThread:-1:PrepRequestProcessor@387] - Processed session >>> termination for sessionid: SESSION_1 >>> 2011-02-11 14:18:55,011 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_1 type:ping cxid:0xfffffffffffffffe >>> zxid:0xfffffffffffffffe txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:55,011 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@160] - sessionid:SESSION_1 >>> type:ping cxid:0xfffffffffffffffe zxid:0xfffffffffffffffe >>> txntype:unknown reqpath:n/a >>> 2011-02-11 14:18:55,011 - INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn$Factory@251] >>> - Accepted socket connection from /HOST_1:40556 >>> 2011-02-11 14:18:55,019 - DEBUG >>> [SyncThread:0:FinalRequestProcessor@78] - Processing request:: >>> sessionid:SESSION_1 type:closeSession cxid:0x0 zxid:0x104 txntype:-11 >>> reqpath:n/a >>> 2011-02-11 14:18:55,019 - DEBUG >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:ZooKeeperServer@590] - >>> Dropping request: No session with sessionid SESSION_1 exists, probably >>> expired and removed >>> 2011-02-11 14:18:55,019 - INFO =A0[SyncThread:0:NIOServerCnxn@1435] - >>> Closed socket connection for client /HOST_1:40555 which had sessionid >>> SESSION_1 >>> 2011-02-11 14:18:55,020 - DEBUG [SyncThread:0:NIOServerCnxn@1451] - >>> ignoring exception during output shutdown >>> java.net.SocketException: Transport endpoint is not connected >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.shutdownOutput(SocketCha= nnelImpl.java:651) >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketAdaptor.shutdownOutput(SocketAdaptor= .java:368) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn.closeSock(N= IOServerCnxn.java:1447) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn.close(NIOSe= rverCnxn.java:1412) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn$Factory.clo= seSessionWithoutWakeup(NIOServerCnxn.java:343) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn$Factory.clo= seSession(NIOServerCnxn.java:330) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.FinalRequestProcessor.pro= cessRequest(FinalRequestProcessor.java:133) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.SyncRequestProcessor.flus= h(SyncRequestProcessor.java:161) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.SyncRequestProcessor.run(= SyncRequestProcessor.java:98) >>> 2011-02-11 14:18:55,021 - DEBUG [SyncThread:0:NIOServerCnxn@1459] - >>> ignoring exception during input shutdown >>> java.net.SocketException: Transport endpoint is not connected >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.shutdown(Native Method) >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketChannelImpl.shutdownInput(SocketChan= nelImpl.java:640) >>> =A0 =A0 =A0 =A0at sun.nio.ch.SocketAdaptor.shutdownInput(SocketAdaptor.= java:360) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn.closeSock(N= IOServerCnxn.java:1455) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn.close(NIOSe= rverCnxn.java:1412) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn$Factory.clo= seSessionWithoutWakeup(NIOServerCnxn.java:343) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn$Factory.clo= seSession(NIOServerCnxn.java:330) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.FinalRequestProcessor.pro= cessRequest(FinalRequestProcessor.java:133) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.SyncRequestProcessor.flus= h(SyncRequestProcessor.java:161) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.SyncRequestProcessor.run(= SyncRequestProcessor.java:98) >>> 2011-02-11 14:18:55,022 - WARN >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn$Factory@272] >>> - Ignoring unexpected runtime exception >>> java.nio.channels.CancelledKeyException >>> =A0 =A0 =A0 =A0at sun.nio.ch.SelectionKeyImpl.ensureValid(SelectionKeyI= mpl.java:55) >>> =A0 =A0 =A0 =A0at sun.nio.ch.SelectionKeyImpl.readyOps(SelectionKeyImpl= .java:69) >>> =A0 =A0 =A0 =A0at org.apache.zookeeper.server.NIOServerCnxn$Factory.run= (NIOServerCnxn.java:241) >>> 2011-02-11 14:18:55,023 - DEBUG >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn@735] - >>> Session establishment request from client /HOST_1:40556 client's >>> lastZxid is 0x0 >>> 2011-02-11 14:18:55,023 - INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn@770] - Client >>> attempting to renew session SESSION_1 at /HOST_1:40556 >>> 2011-02-11 14:18:55,024 - INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn@1573] - >>> Invalid session SESSION_1 for client /HOST_1:40556, probably expired >>> 2011-02-11 14:18:55,025 - WARN >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn@634] - >>> EndOfStreamException: Unable to read additional data from client >>> sessionid SESSION_1, likely client has closed socket >>> 2011-02-11 14:18:55,025 - INFO >>> [NIOServerCxn.Factory:0.0.0.0/0.0.0.0:4799:NIOServerCnxn@1435] - >>> Closed socket connection for client /HOST_1:40556 which had sessionid >>> SESSION_1 >>> >>> >>> On Thu, Feb 10, 2011 at 8:11 PM, Benjamin Reed wr= ote: >>>> do you see anything in the server log? >>>> >>>> ben >>>> >>>> On 02/10/2011 03:16 PM, Patania, Nick wrote: >>>>> >>>>> Hello, >>>>> >>>>> I run the following test: >>>>> >>>>> >>>>> * =A0 =A0 =A0 =A0 Two clients connect to a zookeeper server; the tick= Time on the >>>>> server is 500, and the sessionTimeout on the client is 1000. >>>>> >>>>> * =A0 =A0 =A0 =A0 Kill the host running one of the clients. >>>>> >>>>> * =A0 =A0 =A0 =A0 The second client receives a session timeout. >>>>> >>>>> Is there any reason why this might happen? >>>>> >>>>> Thanks >>>>> Nick Patania >>>>> >>>>> >>>>> ---------------------------------------------------------------------= ----- >>>>> NOTICE: Morgan Stanley is not acting as a municipal advisor and the >>>>> opinions or views contained herein are not intended to be, and do not >>>>> constitute, advice within the meaning of Section 975 of the Dodd-Fran= k Wall >>>>> Street Reform and Consumer Protection Act. If you have received this >>>>> communication in error, please destroy all electronic and paper copie= s and >>>>> notify the sender immediately. Mistransmission is not intended to wai= ve >>>>> confidentiality or privilege. Morgan Stanley reserves the right, to t= he >>>>> extent permitted under applicable law, to monitor electronic communic= ations. >>>>> This message is subject to terms available at the following link: >>>>> http://www.morganstanley.com/disclaimers. If you cannot access these = links, >>>>> please notify us by reply message and we will send the contents to yo= u. By >>>>> messaging with Morgan Stanley you consent to the foregoing. >>>> >>>> >>> >> >