Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 02D3785FA for ; Tue, 13 Sep 2011 11:26:31 +0000 (UTC) Received: (qmail 42473 invoked by uid 500); 13 Sep 2011 11:26:30 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 41923 invoked by uid 500); 13 Sep 2011 11:26:23 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 41911 invoked by uid 99); 13 Sep 2011 11:26:18 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 11:26:18 +0000 X-ASF-Spam-Status: No, hits=2.2 required=5.0 tests=HTML_MESSAGE,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of martin@attivio.com designates 66.227.116.198 as permitted sender) Received: from [66.227.116.198] (HELO mx01.attivio.com) (66.227.116.198) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 11:26:12 +0000 X-ASG-Debug-ID: 1315913150-cc090db1d78b-i9L50J Received: from bos0ex01.corp.attivio.com (bos0ex01.corp.attivio.com [10.1.1.81]) by mx01.attivio.com with ESMTP id AcAITobGEN77L0Md for ; Tue, 13 Sep 2011 07:25:50 -0400 (EDT) X-Barracuda-Envelope-From: martin@attivio.com Received: from bos0ex01.corp.attivio.com ([10.1.1.81]) by bos0ex01.corp.attivio.com ([10.1.1.81]) with mapi; Tue, 13 Sep 2011 07:25:50 -0400 From: Martin Serrano To: "user@zookeeper.apache.org" Date: Tue, 13 Sep 2011 07:25:49 -0400 Subject: disconnects and auto renewal Thread-Topic: disconnects and auto renewal X-ASG-Orig-Subj: disconnects and auto renewal Thread-Index: AcxxfbRb+VZn8eCaRWu58azgd563Og== Message-ID: <9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7@bos0ex01.corp.attivio.com> Accept-Language: en-US Content-Language: en-US X-MS-Has-Attach: X-MS-TNEF-Correlator: acceptlanguage: en-US Content-Type: multipart/alternative; boundary="_000_9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7bos0ex01corpa_" MIME-Version: 1.0 X-Barracuda-Connect: bos0ex01.corp.attivio.com[10.1.1.81] X-Barracuda-Start-Time: 1315913150 X-Barracuda-URL: http://10.1.1.7:8000/cgi-mod/mark.cgi X-Virus-Scanned: by bsmtpd at attivio.com X-Barracuda-Spam-Score: 0.00 X-Barracuda-Spam-Status: No, SCORE=0.00 using global scores of TAG_LEVEL=1000.0 QUARANTINE_LEVEL=1000.0 KILL_LEVEL=9.0 tests=HTML_MESSAGE X-Barracuda-Spam-Report: Code version 3.2, rules version 3.2.2.74398 Rule breakdown below pts rule name description ---- ---------------------- -------------------------------------------------- 0.00 HTML_MESSAGE BODY: HTML included in message --_000_9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7bos0ex01corpa_ Content-Type: text/plain; charset="us-ascii" Content-Transfer-Encoding: quoted-printable Hi, We have added code to our application to reconnect and re-establish watches= when we receive a Disconnected event. I am running tests on a heavily loa= ded system where the zookeeper server and clients are all impacted. On thi= s test system we regularly experience session timeouts and appropriately re= act to reconnect and set up our watches. There is an uncommon case that I = am having trouble puzzling out. When running one of our tests in a loop ab= out 1% of the time we hit a case where on the client side we think the sess= ion has expired but on the server side it has been renewed. We will then = fail to be able to create an ephemeral node because it already exists and d= oes not ever get cleaned up (since the previous session is still valid). I= 'm trying to figure out if we are misusing the API or if we have encountere= d a bug. I'm happy to provide more details. One thing I am wondering is = if it is inappropriate to create a new session within the event thread of a= nother session which has received the disconnected event. Thanks, Martin Serrano The logs are filtered a little just to keep this email a reasonable size,= but everything relevant to the session in question is here. Associated client application logs (the reconnecting message is ours where = we re-establish watches, etc using a new session): 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd356630= 00d] - Session level event, session 1325fd35663000d: WatchedEvent state:Dis= connected type:None path:null 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd356630= 00d] - Reconnecting, current session 1325fd35663000d 2011-09-12 18:49:55,729 INFO [ZooKeeperClientCnxn--EventThread-1325fd356630= 00d] - Connecting to zookeeper on localhost:16015 2011-09-12 18:50:08,823 INFO [main_master_10.2.1.100_16040] - obj[76a9b9c] = Reestablishing agent watches 2011-09-12 18:50:09,120 WARN [ZooKeeperClientCnxn--EventThread-1325fd356630= 00d] - obj[1e5b04ae] Failed to write status node multinode-monitor/master. = Saw previous value null, retrying org.apache.zookeeper.KeeperException$NodeExistsException - KeeperErrorCod= e =3D NodeExists for multinode-monitor/master Associated client zookeeper logs: 2011-09-12 18:48:15,995 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1= 325fd35663000d] - Session establishment complete on server localhost/127.0.= 0.1:16015, sessionid =3D 0x1325fd35663000d, negotiated timeout =3D 40000 2011-09-12 18:49:31,245 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1= 325fd35663000d] - Client session timed out, have not heard from server in 2= 6672ms for sessionid 0x1325fd35663000d, closing socket connection and attem= pting reconnect 2011-09-12 18:49:55,729 INFO ZooKeeper [ZooKeeperClientCnxn--EventThread-1= 325fd35663000d] - Initiating client connection, connectString=3Dlocalhost:1= 6015 sessionTimeout=3D600000 watcher=3Dcom.attivio.global.AieZooKeeper@4f53= 6eec 2011-09-12 18:50:08,057 INFO ClientCnxn [ZooKeeperClientCnxn--SendThread-1= 325fd35663000d] - Session establishment complete on server localhost/127.0.= 0.1:16015, sessionid =3D 0x1325fd35663000d, negotiated timeout =3D 40000 Associated server zookeeper logs: 2011-09-12 18:48:16,010 INFO NIOServerCnxn [SyncThread:0] - Established se= ssion 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1= :4257 2011-09-12 18:49:54,120 WARN NIOServerCnxn [NIOServerCxn.Factory:localhost= /127.0.0.1:16015] - Exception causing close of session 0x1325fd35663000d du= e to java.io.IOException: An established connection was aborted by the soft= ware in your host machine 2011-09-12 18:49:54,120 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost= /127.0.0.1:16015] - Closed socket connection for client /127.0.0.1:4257 whi= ch had sessionid 0x1325fd35663000d 2011-09-12 18:50:08,057 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost= /127.0.0.1:16015] - Client attempting to renew session 0x1325fd35663000d at= /127.0.0.1:4566 2011-09-12 18:50:08,057 INFO NIOServerCnxn [NIOServerCxn.Factory:localhost= /127.0.0.1:16015] - Established session 0x1325fd35663000d with negotiated t= imeout 40000 for client /127.0.0.1:4566 --_000_9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7bos0ex01corpa_--