Return-Path: X-Original-To: apmail-zookeeper-user-archive@www.apache.org Delivered-To: apmail-zookeeper-user-archive@www.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 1339210BAE for ; Wed, 9 Sep 2015 20:29:31 +0000 (UTC) Received: (qmail 31776 invoked by uid 500); 9 Sep 2015 20:29:30 -0000 Delivered-To: apmail-zookeeper-user-archive@zookeeper.apache.org Received: (qmail 31717 invoked by uid 500); 9 Sep 2015 20:29:30 -0000 Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@zookeeper.apache.org Delivered-To: mailing list user@zookeeper.apache.org Received: (qmail 31703 invoked by uid 99); 9 Sep 2015 20:29:30 -0000 Received: from Unknown (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 09 Sep 2015 20:29:30 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id B360F180427 for ; Wed, 9 Sep 2015 20:29:29 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -0.002 X-Spam-Level: X-Spam-Status: No, score=-0.002 tagged_above=-999 required=6.31 tests=[RCVD_IN_MSPIKE_H2=-0.001, SPF_PASS=-0.001] autolearn=disabled Received: from mx1-us-west.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id ejRRMvxIOigw for ; Wed, 9 Sep 2015 20:29:28 +0000 (UTC) Received: from mout.gmx.net (mout.gmx.net [212.227.15.19]) by mx1-us-west.apache.org (ASF Mail Server at mx1-us-west.apache.org) with ESMTPS id 3A53420107 for ; Wed, 9 Sep 2015 20:29:28 +0000 (UTC) Received: from miraculix.home ([178.196.190.51]) by mail.gmx.com (mrgmx003) with ESMTPSA (Nemesis) id 0LqiJO-1YwZIS2mag-00ePRx for ; Wed, 09 Sep 2015 22:29:26 +0200 From: Simon Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable Subject: Work Sharing with Consistent Hashing Message-Id: <91AC9345-AC63-43AA-8FBD-DDE5C7304845@gmx.ch> Date: Wed, 9 Sep 2015 22:29:26 +0200 To: user@zookeeper.apache.org Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\)) X-Mailer: Apple Mail (2.2104) X-Provags-ID: V03:K0:87Vd1nrn8AXFElOXLM4xUwhHg2WoJYff8YMrgV1nFc6oTOlRd6O fD1rSCwBuvuXEQCMn4BikW/9nvk5F7TdOFlR1jp6OaY6ISk/IgeFKW6mOIGmgub0b4FjzCM JXfnSKHqdCxLn+N6t0kkCwOpTbLEEUHkVlTVdVV8qiwpJsPrs+QPbkOjdrq5/TYoeIKahnP bayL1Ji94kD3aGf2X6WZg== X-UI-Out-Filterresults: notjunk:1;V01:K0:K2JYmgJBXHE=:YUDG/gmt90wiSVs1+vf3+Z ftmauAQTIcFoh4PVkGt+T+XrVu0xR44HpZvjWMyFw4v8lzgDKd1zOHiJ7KlfmLser9BjcxiGn qbMyUuR1UX/imIzMiVU6ru8wPguOLYE63k/eZVe/IwhZ1fpXMwM9ekz5hG4U3aET4XfkPeMZ7 zP6wMHgAfcjE8ziT38U/NxqTMgeV8NDiA1KDlpoXn8fnCa263Z0pRkdThtVCH3p7zjpbyBidq 2OgsCs/uTsyKPhhUlzEN/YHAq2OvrxzsiG776kbX8EllRzOJ0GPXMVKL+E3LrIXwp0a8liYqy 6Ca1a8faVSNnLn8aSO+BKba7IFDstW564fzzxKyTF/9mLDKQlXYhWjlpMfT6/oLTZxs6Ehmlq P1/D8U3Xdk5pcTm5oef477r2bt1YVcr1fSUsSjcW1QSF8YQBKY8/ozq44CdgHupEFS5zmlZoG JwUGEVSatat09wZnQh3mHSPYBDyLbULj4a70IB50aOEKnC92PirFcleFVzbmHOMg6mgoVRvtA FEFRPYR0H+IRklu5kv9dU8umsm9oO6WMwMNkwpqcvGVH/uAYu5ALgLA3eI8oC6wnPnk07AkSF LMSzddZCTWPoOIzLZsYFBeqMvRectjHInnhqG1/5Imj7ecrDckh9yj4KpsHGEwe+SpzRZcVQC 8fcvctbLBLzWtz/TYsLB0HyxlkzWZNUytv8xhA6pMAF0K6PW5XQxTgbOwB0vFc73XKy1Sv1OI SMHWEg/T1xANJgQZxlFk4wNaaza2S5Mi5OA7jw== Hi I have been reading a lot about Zookeeper lately because we have the = requirement to distribute our workflow engine (for performance and = reliability reasons). Currently, we have a single instance. If it fails, = no work is done anymore. One approach mentioned in this mailing list and = online documents is a master - worker setup. The master is made reliable = by using multiple instances and electing a leader. This setup might work = unless there are too many events to be dispatched by a single master. I = am a bit concerned about that as currently the work dispatcher does no = IO. It is blazing fast. If the work dispatcher (scheduler) now has to = communicate over the network it might get too slow. So I thought about a = different solution and would like to hear what you think about it. Let=E2=80=99s say each workflow engine instance (referred as node from = now on) registers an ephemeral znode under /nodes. Each node installs a = children watch on /nodes. The nodes uses this information to populate a = hashring. Each node is only responsible for workflow instances that map = to their corresponding part of the hashring. Event notifications would = then be dispatched to the correct node based on the same hashring. The = persistence of workflow instances and events would still be done in a = highly available database. Only notifications about events would be = dispatched. Whenever an event notification is received a workflow = instance has to be dispatched to a given worker thread within a node.=20 The interesting cases happen if a new node is added to the cluster or if = a workflow engine instance fails. Let=E2=80=99s talk about failures = first. As processing a workflow instance takes some time we cannot = simply switch over to a new instance right away. After all, the previous = node might still be running (i.e. it did not crash, just failed to = heartbeat). We would have to wait some time (how long?!) until we know = that the failing node has dropped the work it was doing (it gets = notified of the session expiration). The failing node cannot communicate = with Zookeeper so it has no way of telling that it was done. We also = cannot use the central database to find out what happened. If a node = fails (e.g. due to hardware failure or network failure) it cannot update = the database. The database will look the same regardless of whether it = is working or has crashed. It must be impossible that two workflow = engines work on the same workflow instance. This would result in = duplicate messages being sent out to backend systems (not a good idea).=20= The same issue arrises with adding a node. The new node has to wait = until the other nodes have stopped working. Ok, I could use Zookeeper in = this case to lock workflow instances. Only if the lock is available a = node can start working on an workflow instance. Whenever a node is added = it will look up pending workflow instances in the database (those that = have open events).=20 To summarize: instead of having a central master that coordinates all = work I would like to divide the work =E2=80=9Cspace=E2=80=9D into = segments by using consistent hashing. Each node is an equal peer and can = operate freely within its assigned work =E2=80=9Cspace=E2=80=9D.=20 What do you think about that setup? Is it something completely stupid?! = If yes, let me know why. Has somebody done something similar = successfully? What would I have to watch out for? What would I need to = add to the basic setup mentioned above?=20 Regards, Simon=