Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8832A7009 for ; Sun, 13 Nov 2011 23:40:30 +0000 (UTC) Received: (qmail 72398 invoked by uid 500); 13 Nov 2011 23:40:30 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 72373 invoked by uid 500); 13 Nov 2011 23:40:30 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 72365 invoked by uid 99); 13 Nov 2011 23:40:29 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Nov 2011 23:40:29 +0000 X-ASF-Spam-Status: No, hits=-0.7 required=5.0 tests=FREEMAIL_FROM,RCVD_IN_DNSWL_LOW,SPF_PASS X-Spam-Check-By: apache.org Received-SPF: pass (nike.apache.org: domain of ahfeel@gmail.com designates 209.85.161.170 as permitted sender) Received: from [209.85.161.170] (HELO mail-gx0-f170.google.com) (209.85.161.170) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 13 Nov 2011 23:40:23 +0000 Received: by ggnk4 with SMTP id k4so5857029ggn.15 for ; Sun, 13 Nov 2011 15:40:02 -0800 (PST) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=gamma; h=mime-version:sender:date:x-google-sender-auth:message-id:subject :from:to:content-type:content-transfer-encoding; bh=tCxzfAEPf+PVd0+CCGZef51Y5/ctLwYOArxufUZwxD0=; b=U4eKIFGAWtKH3ZPRDhu9qvx65/hnDfaur18eiYpRCtnysJHTQzoTInMJabJB6v65qx YT86/timz1/mJk6HazT73tLYtJqwgw2s9CJV4FQuJonBFNcYUNIULyuDi+2m6c5gDoez mYMWereJHDd9GuCwmsppdC2QVT+xnLMCD1VWA= MIME-Version: 1.0 Received: by 10.68.38.71 with SMTP id e7mr44616002pbk.88.1321227602219; Sun, 13 Nov 2011 15:40:02 -0800 (PST) Sender: ahfeel@gmail.com Received: by 10.68.48.166 with HTTP; Sun, 13 Nov 2011 15:40:02 -0800 (PST) Date: Mon, 14 Nov 2011 00:40:02 +0100 X-Google-Sender-Auth: lAasA35VA4Jo6qRGWn6o9g9_6nw Message-ID: Subject: Missing session state handling in most Leader Election implementations From: =?ISO-8859-1?Q?J=E9r=E9mie_BORDIER?= To: user@zookeeper.apache.org Content-Type: text/plain; charset=ISO-8859-1 Content-Transfer-Encoding: quoted-printable X-Virus-Checked: Checked by ClamAV on apache.org Hi Folks, We have been playing around with ZooKeeper for a few weeks now, and reading carefully through the documentation I noticed this statement: If you are using watches, you must look for the connected watch event. When a ZooKeeper client disconnects from a server, you will not receive notification of changes until reconnected. If you are watching for a znode to come into existence, you will miss the event if the znode is created and deleted while you are disconnected. As noticed in ZOOKEEPER-1209, this can cause really important issues. As Leader election is one of the most demanded feature / recipe, I would really like to see the official recipe fixed and fully functional. I decided to throw a look at other implementations of the leader election and surprisingly, none of them seemed to care about the Disconnected / Expired / SyncConnected events in a simple way. Here's my quick analysis of what they do, and I'd love to know whether I'm missing something or if they are really wrong: Twitter commons library Election recipe is based on their "Group" implementation, with EPHEMERAL|SEQUENTIAL nodes in the same way of the official LES algorithm. Looking at the Group impl ( https://github.com/twitter/commons/blob/master/src/java/com/twitter/common/= zookeeper/Group.java ), they handle the Expired event and retry to join / watch, which makes a getClient() that will recreate the connection if the connection has expired. This looks fine for the Expired event, but what about Disconnected / SyncConnected events ? Nothing. Netflix' curator library has an approach where the leader acquires an inter process mutex, backed by a group with EPHEMERAL|SEQUENTIAL nodes also. Netflix's library has a big advantage: It has a built in API for retrying actions, so leader election will try to acquire the lock, and retry if anything goes wrong in the middle. In case of any event, the loop waiting for the lock will be notified, and will retry in case of any failure, so a Disconnected or Expired event would be handled properly. On the other side, it seems that once the leader has been elected, the event just seems to be ignored. This may lead to the same split brain issue than the original LES example. (see https://github.com/Netflix/curator/blob/master/curator-recipes/src/main/jav= a/com/netflix/curator/framework/recipes/locks/LockInternals.java for details). Here's all I came up to so far. If you guys have the time to throw a look to these implementations, I would love to know if I missed something. So, I think this split brain issue may almost never happen, but as usually what should never happen hits you hard when you don't expect it. A robust Leader election implementation would be really great to have. Cheers, J=E9r=E9mie