Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 6D8908A72 for ; Wed, 31 Aug 2011 01:49:04 +0000 (UTC) Received: (qmail 75597 invoked by uid 500); 31 Aug 2011 01:49:04 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 75438 invoked by uid 500); 31 Aug 2011 01:49:03 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 75430 invoked by uid 99); 31 Aug 2011 01:49:02 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2011 01:49:02 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of junrao@gmail.com designates 74.125.82.170 as permitted sender) Received: from [74.125.82.170] (HELO mail-wy0-f170.google.com) (74.125.82.170) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 31 Aug 2011 01:48:56 +0000 Received: by wyf23 with SMTP id 23so170131wyf.15 for ; Tue, 30 Aug 2011 18:48:36 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; bh=Ksjf/5g4SipQzKQb4lkit4vqo0RGvyfx20UfKwkvqMk=; b=gHWMRFi4rZveXcVfU6an4qz/UgGhn9hCrKvxti7D7Si1HG1b/f9SEGOi9FjoPkxKHa 1yp1DVFKC2zfnrxy5eElBfw1V1SQip83Swg3WU2SNmp6W0Yg68jYc/sqqxUcQlHZ9/ut vJL9PdhqHbdlnQAHmbd/qtDMcQpJaNq6AuRWw= MIME-Version: 1.0 Received: by 10.216.229.88 with SMTP id g66mr503121weq.9.1314755315846; Tue, 30 Aug 2011 18:48:35 -0700 (PDT) Received: by 10.216.170.134 with HTTP; Tue, 30 Aug 2011 18:48:35 -0700 (PDT) In-Reply-To: <69D3016305F9084FBD2C4A0DF189BD5C1772539992@GSCMAMP02EX.firmwide.corp.gs.com> References: <69D3016305F9084FBD2C4A0DF189BD5C1772539931@GSCMAMP02EX.firmwide.corp.gs.com> <201108291654.19951.thomas@koch.ro> <69D3016305F9084FBD2C4A0DF189BD5C1772539992@GSCMAMP02EX.firmwide.corp.gs.com> Date: Tue, 30 Aug 2011 18:48:35 -0700 Message-ID: Subject: Re: zk keeps disconnecting and reconnecting From: Jun Rao To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=0016e6470f98914ef304abc35391 X-Virus-Checked: Checked by ClamAV on apache.org --0016e6470f98914ef304abc35391 Content-Type: text/plain; charset=ISO-8859-1 I was also wondering why our clients get disconnected in the first place since the ZK servers are all up. The following are the logs when the first disconnect happens. Does anyone know why the client can't seem to connect to most servers? Also, is "Session 0x1320765b0ac002e for server nulll" normal? Thanks, 2011/08/29 07:33:51.824 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect2011/08/29 07:33:51.824 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002e, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:51.990 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected)2011/08/29 07:33:52.019 INFO [ZkClient] [main-EventThread] [kafka] zookeeper state changed (Disconnected) 2011/08/29 07:33:52.092 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app29.stg/172.18.98.89:12913 2011/08/29 07:33:52.093 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Socket connection established to esv4-app29.stg/172.18.98.89:12913, initiating session 2011/08/29 07:33:52.094 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:52.652 INFO [ClientCnxn] [main-SendThread(esv4-app27.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/08/29 07:33:52.652 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session2011/08/29 07:33:53.075 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002e, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:53.108 INFO [ClientCnxn] [main-SendThread(esv4-app29.stg:12913)] [kafka] Opening socket connection to server esv4-app28.stg/172.18.98.101:12913 2011/08/29 07:33:53.108 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Socket connection established to esv4-app28.stg/172.18.98.101:12913, initiating session 2011/08/29 07:33:53.109 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Unable to read additional data from server sessionid 0x1320765b0ac002f, likely server h as closed socket, closing socket connection and attempting reconnect 2011/08/29 07:33:53.577 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/08/29 07:33:53.578 WARN [ClientCnxn] [main-SendThread(esv4-app30.stg:12913)] [kafka] Session 0x1320765b0ac002e for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) 2011/08/29 07:33:53.683 INFO [ClientCnxn] [main-SendThread(esv4-app28.stg:12913)] [kafka] Opening socket connection to server esv4-app30.stg/172.18.98.90:12913 2011/08/29 07:33:53.683 WARN [ClientCnxn] [main-SendThread(esv4-app30.stg:12913)] [kafka] Session 0x1320765b0ac002f for server null, unexpected error, closing socket connection and attempting reconnect java.net.ConnectException: Connection refused at sun.nio.ch.SocketChannelImpl.checkConnect(Native Method) at sun.nio.ch.SocketChannelImpl.finishConnect(SocketChannelImpl.java:574) at org.apache.zookeeper.ClientCnxn$SendThread.run(ClientCnxn.java:1119) Jun On Mon, Aug 29, 2011 at 10:50 AM, Fournier, Camille F. < Camille.Fournier@gs.com> wrote: > Well, it causes the problem you are seeing. If you set any watchers with a > chroot and then your client gets disconnected with these watches > outstanding, when you reconnect you will try to reset them and they are > probably on paths that don't exist (if you are creating everything under > path /kafka-tracking). So you get a notification about the watches > immediately after resetting them, which causes the string out of bounds > exception. > > The only fix is to disable auto watch reset, and then have your own client > reset watches when it gets a reconnected event. I suspect it would be easier > for you to take a shot at fixing the bug than to rewrite your client to > handle this. Thomas provided a patch with tests that presumably show the > error, so all you need is a fix to make them pass. > > > C > > -----Original Message----- > From: Jun Rao [mailto:junrao@gmail.com] > Sent: Monday, August 29, 2011 12:39 PM > To: user@zookeeper.apache.org; thomas@koch.ro > Subject: Re: zk keeps disconnecting and reconnecting > > What's the impact of ZOOKEEPER-961? If it shows up, does that mean the > client won't get any watcher events afterwards? If so, this sounds like a > blocker for 3.4 release to me. What's the temporary solution for 3.3.3? > > Also, for the very first time that the ZK client gets disconnected, I saw > the following entry in the log. It seems that the client can't ping the > server for 4 seconds. The ZK server was up at that time and the load was > minimal. What could cause the time out? Client GC pauses? > > 2011/08/26 10:58:22.306 INFO [ClientCnxn] > [main-SendThread(esv4-app27.stg:12913)] [kafka] Client session timed out, > have not heard from server in 4001ms for sessionid 0x131f > ddd84bc0006, closing socket connection and attempting reconnect > > Thanks, > > Jun > > On Mon, Aug 29, 2011 at 7:54 AM, Thomas Koch wrote: > > > Fournier, Camille F.: > > > Did anyone ever check resetting watches at client reconnect on a client > > > with a chroot? Looking at the code, we store the watches associated > with > > > the non-chroot path, but they are set by the original request > prepending > > > chroot to the request. However, it looks like the SetWatches request on > > > reconnect just calls get on the various watch lists from ZooKeeper, > which > > > don't have the prepended chroot. > > > > > > I haven't written a test but I would bet dollars to donuts this is the > > > problem. > > > > > > C > > seems to be this: > > ZOOKEEPER-961, ZOOKEEPER-1091 > > > > Regards, > > > > Thomas Koch, http://www.koch.ro > > > --0016e6470f98914ef304abc35391--