Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1C89782EB for ; Tue, 13 Sep 2011 11:39:39 +0000 (UTC) Received: (qmail 56060 invoked by uid 500); 13 Sep 2011 11:39:38 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 55528 invoked by uid 500); 13 Sep 2011 11:39:37 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 55520 invoked by uid 99); 13 Sep 2011 11:39:34 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 11:39:34 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,NORMAL_HTTP_TO_IP,RCVD_IN_DNSWL_LOW,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL,WEIRD_PORT X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ted.dunning@gmail.com designates 209.85.216.177 as permitted sender) Received: from [209.85.216.177] (HELO mail-qy0-f177.google.com) (209.85.216.177) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 13 Sep 2011 11:39:26 +0000 Received: by qyk31 with SMTP id 31so246409qyk.15 for ; Tue, 13 Sep 2011 04:39:05 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:from:date:message-id:subject:to :content-type; bh=06rMlOHtFT/7sH7WTn+vDVzRdxx8dHz8RhgeM3uNsL4=; b=qSajm80uQON1uxw0VMKVVv+0WT0UHXmBIY0EdVyIqgCnRiEnAS+sqv2AUw6EUlgOqK sGBSKEbROwFS6I/1bn//85Mz0ZaDdylRSFolQkHZFDF+YoAGwHwAk/e1jmGeiSLVn+9R tn31bm3kXE8Vi23qLb5IEcsrYe1CqQBd6lJQk= Received: by 10.224.198.65 with SMTP id en1mr482704qab.27.1315913945072; Tue, 13 Sep 2011 04:39:05 -0700 (PDT) MIME-Version: 1.0 Received: by 10.224.89.13 with HTTP; Tue, 13 Sep 2011 04:38:45 -0700 (PDT) In-Reply-To: <9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7@bos0ex01.corp.attivio.com> References: <9694A6C3D68A4249BD9E1A875B6BA81E05B0F090D7@bos0ex01.corp.attivio.com> From: Ted Dunning Date: Tue, 13 Sep 2011 11:38:45 +0000 Message-ID: Subject: Re: disconnects and auto renewal To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=20cf300fb11b40458d04acd11771 X-Virus-Checked: Checked by ClamAV on apache.org --20cf300fb11b40458d04acd11771 Content-Type: text/plain; charset=UTF-8 Martin, >From your email, it sounds like there might be a bit of confusion between disconnection and session expiration. Are you sure you are clear on the difference between these? Also, I have seen cases in my own code where I confused myself by trying to re-create ephemeral files after a client program crashed. I knew that the client had crashed as soon as it happened, but the Zookeeper servers could only determine this after a bit of time. My new program tried to recreate the ephemerals to indicate that it was back but since the old ephemerals were still there, that failed. Then a short time later when the ZK cluster understood that the old client was gone, the ephemerals disappeared even though the new client was humming along nicely. My solution was to delete the ephemerals when creating them. Is it possible you have a similar confusion? On Tue, Sep 13, 2011 at 11:25 AM, Martin Serrano wrote: > Hi, > > We have added code to our application to reconnect and re-establish watches > when we receive a Disconnected event. I am running tests on a heavily > loaded system where the zookeeper server and clients are all impacted. On > this test system we regularly experience session timeouts and appropriately > react to reconnect and set up our watches. There is an uncommon case that I > am having trouble puzzling out. When running one of our tests in a loop > about 1% of the time we hit a case where on the client side we think the > session has expired but on the server side it has been renewed. We will > then fail to be able to create an ephemeral node because it already exists > and does not ever get cleaned up (since the previous session is still > valid). I'm trying to figure out if we are misusing the API or if we have > encountered a bug. I'm happy to provide more details. One thing I am > wondering is if it is inappropriate to create a new session within the event > thread of another session which has received the disconnected event. > > Thanks, > Martin Serrano > > The logs are filtered a little just to keep this email a reasonable size, > but everything relevant to the session in question is here. > > Associated client application logs (the reconnecting message is ours where > we re-establish watches, etc using a new session): > > 2011-09-12 18:49:55,729 INFO > [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Session level event, > session 1325fd35663000d: WatchedEvent state:Disconnected type:None path:null > 2011-09-12 18:49:55,729 INFO > [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Reconnecting, current > session 1325fd35663000d > 2011-09-12 18:49:55,729 INFO > [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Connecting to zookeeper > on localhost:16015 > 2011-09-12 18:50:08,823 INFO [main_master_10.2.1.100_16040] - obj[76a9b9c] > Reestablishing agent watches > 2011-09-12 18:50:09,120 WARN > [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - obj[1e5b04ae] Failed to > write status node multinode-monitor/master. Saw previous value null, > retrying > org.apache.zookeeper.KeeperException$NodeExistsException - KeeperErrorCode > = NodeExists for multinode-monitor/master > > Associated client zookeeper logs: > > 2011-09-12 18:48:15,995 INFO ClientCnxn > [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment > complete on server localhost/127.0.0.1:16015, sessionid = > 0x1325fd35663000d, negotiated timeout = 40000 > 2011-09-12 18:49:31,245 INFO ClientCnxn > [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Client session timed > out, have not heard from server in 26672ms for sessionid 0x1325fd35663000d, > closing socket connection and attempting reconnect > 2011-09-12 18:49:55,729 INFO ZooKeeper > [ZooKeeperClientCnxn--EventThread-1325fd35663000d] - Initiating client > connection, connectString=localhost:16015 sessionTimeout=600000 > watcher=com.attivio.global.AieZooKeeper@4f536eec =com.attivio.global.AieZooKeeper@4f536eec> > 2011-09-12 18:50:08,057 INFO ClientCnxn > [ZooKeeperClientCnxn--SendThread-1325fd35663000d] - Session establishment > complete on server localhost/127.0.0.1:16015, sessionid = > 0x1325fd35663000d, negotiated timeout = 40000 > > Associated server zookeeper logs: > > 2011-09-12 18:48:16,010 INFO NIOServerCnxn [SyncThread:0] - Established > session 0x1325fd35663000d with negotiated timeout 40000 for client / > 127.0.0.1:4257 > 2011-09-12 18:49:54,120 WARN NIOServerCnxn > [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Exception causing close > of session 0x1325fd35663000d due to java.io.IOException: An established > connection was aborted by the software in your host machine > 2011-09-12 18:49:54,120 INFO NIOServerCnxn > [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Closed socket > connection for client /127.0.0.1:4257 which had sessionid > 0x1325fd35663000d > 2011-09-12 18:50:08,057 INFO NIOServerCnxn > [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Client attempting to > renew session 0x1325fd35663000d at /127.0.0.1:4566 > 2011-09-12 18:50:08,057 INFO NIOServerCnxn > [NIOServerCxn.Factory:localhost/127.0.0.1:16015] - Established session > 0x1325fd35663000d with negotiated timeout 40000 for client /127.0.0.1:4566 > --20cf300fb11b40458d04acd11771--