Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id AE4B520DF for ; Thu, 21 Apr 2011 20:45:50 +0000 (UTC) Received: (qmail 98067 invoked by uid 500); 21 Apr 2011 20:45:50 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 98024 invoked by uid 500); 21 Apr 2011 20:45:50 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 98010 invoked by uid 99); 21 Apr 2011 20:45:50 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Apr 2011 20:45:50 +0000 X-ASF-Spam-Status: No, hits=1.5 required=5.0 tests=FREEMAIL_FROM,HTML_MESSAGE,RCVD_IN_DNSWL_LOW,RFC_ABUSE_POST,SPF_PASS,T_TO_NO_BRKTS_FREEMAIL X-Spam-Check-By: apache.org Received-SPF: pass (athena.apache.org: domain of scottfines@gmail.com designates 209.85.214.170 as permitted sender) Received: from [209.85.214.170] (HELO mail-iw0-f170.google.com) (209.85.214.170) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 21 Apr 2011 20:45:44 +0000 Received: by iwn3 with SMTP id 3so102809iwn.15 for ; Thu, 21 Apr 2011 13:45:23 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=domainkey-signature:mime-version:in-reply-to:references:date :message-id:subject:from:to:content-type; bh=7OdnfFC8qCjZh0vr//ZfqDmEbkzAZ9kf6aEXJ059yhw=; b=QxSv5E5YH+KdF8NjOGmpmCDSeN2Olp83L6DXMUmVqSjcWuvNg1lNBeIOY8NK3cLJnk wavKC1opem6nGV2svsPgbfry1CnDLsYovnEU1BXETsJFr5yuhZcFmv4GiZvw0r9MVLIY Q+mwctTXNCxjbY6fSGnrlnGWmnt0Y/AeSLJCk= DomainKey-Signature: a=rsa-sha1; c=nofws; d=gmail.com; s=gamma; h=mime-version:in-reply-to:references:date:message-id:subject:from:to :content-type; b=GneVIR2ozrJWx/eChM+fMPIanr6J5hwi6kQkmHJUKfOXdBI3V9if7jYxjBsFp1GIvP DIeXVmgls3anW/GvyluniH9qFlTrXc9WzPU6ErVGcbvvMPMTb7KHBH3NrPYwIGMuAnIS iv5OFlUY51mILP2l9rTVZEmQCX4+yH/rzKF/I= MIME-Version: 1.0 Received: by 10.42.142.137 with SMTP id s9mr454559icu.122.1303418723223; Thu, 21 Apr 2011 13:45:23 -0700 (PDT) Received: by 10.42.170.135 with HTTP; Thu, 21 Apr 2011 13:45:23 -0700 (PDT) In-Reply-To: References: Date: Thu, 21 Apr 2011 15:45:23 -0500 Message-ID: Subject: Re: Unexpected behavior with Session Timeouts in Java Client From: Scott Fines To: user@zookeeper.apache.org Content-Type: multipart/alternative; boundary=90e6ba6e8ec4fdc7cf04a173d19e --90e6ba6e8ec4fdc7cf04a173d19e Content-Type: text/plain; charset=windows-1252 Content-Transfer-Encoding: quoted-printable Ryan, That is a fair point in that I would have consistency of services--that is, that I would be pretty sure I'd only have one service running at a time. However, my particular application demands are such that just stopping and re-starting on disconnected events is not a good idea. What I'm writing is a connector between two data centers, where the measure= d latency is on the order of seconds, and each time a service connects, it must transfer (hopefully only a few) megabytes of data, which I've measured to take on the order of minutes. On the other hand, it is not unusual for u= s to receive a disconnected event every now and then, which is generally resolved on the order of milliseconds. Clearly, I don't want to recreate a minutes-long process every time we get a milliseconds-long disconnection which does not remove the service's existing leadership. So, when the leader receives a disconnected event, it queues up events to process, but holds on to its connections and continues to receive events while it waits for a connection to ZK to be re-established. If the connection to ZK comes back online within the session timeout window, then it will just turn processing back on as if nothing happened. However, if th= e session timeout happens, then the client must cut all of its connections an= d kill itself with fire, rather than overwrite what the next leader does. The= n the next leader has to go through the expensive process of starting the service back up. Hopefully that will give some color for why I'm concerned about this situation. Thanks, Scott On Thu, Apr 21, 2011 at 2:53 PM, Ryan Kennedy wrote: > Scott: > > the right answer in this case is for the leader to watch for the > "disconnected" event and shut down. If the connection re-establishes, > the leader should still be the leader (their ephemeral sequential node > should still be there), in which case it can go back to work. If the > connection doesn't re-establish, one of two things may happen=85 > > 1) Your leader stays in the disconnected state (because it's unable to > reconnect), meanwhile the zookeeper server expires the session > (because it hasn't seen a heartbeat), deletes the ephemeral sequential > node and a new worker is promoted to leader. > > 2) Your leader quickly transitions to the expired state, the ephemeral > node is lost and a new worker is promoted to leader. > > In both cases, your initial leader should see a disconnected event > first. If it shuts down when it sees that event, you should be > relatively safe in thinking that you only have one worker going at a > time. > > Once your initial leader sees the expiration event, it can try to > reconnect to the ensemble, create the new ephemeral sequential node > and get back into the queue for being a leader. > > Ryan > --90e6ba6e8ec4fdc7cf04a173d19e--