Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
From: Simon <cocoa@gmx.ch>
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
Subject: Work Sharing with Consistent Hashing
Message-Id: <91AC9345-AC63-43AA-8FBD-DDE5C7304845@gmx.ch>
Date: Wed, 9 Sep 2015 22:29:26 +0200
To: user@zookeeper.apache.org
Mime-Version: 1.0 (Mac OS X Mail 8.2 \(2104\))

Hi

I have been reading a lot about Zookeeper lately because we have the =
requirement to distribute our workflow engine (for performance and =
reliability reasons). Currently, we have a single instance. If it fails, =
no work is done anymore. One approach mentioned in this mailing list and =
online documents is a master - worker setup. The master is made reliable =
by using multiple instances and electing a leader. This setup might work =
unless there are too many events to be dispatched by a single master. I =
am a bit concerned about that as currently the work dispatcher does no =
IO. It is blazing fast. If the work dispatcher (scheduler) now has to =
communicate over the network it might get too slow. So I thought about a =
different solution and would like to hear what you think about it.

Let=E2=80=99s say each workflow engine instance (referred as node from =
now on) registers an ephemeral znode under /nodes. Each node installs a =
children watch on /nodes. The nodes uses this information to populate a =
hashring. Each node is only responsible for workflow instances that map =
to their corresponding part of the hashring. Event notifications would =
then be dispatched to the correct node based on the same hashring. The =
persistence of workflow instances and events would still be done in a =
highly available database. Only notifications about events would be =
dispatched. Whenever an event notification is received a workflow =
instance has to be dispatched to a given worker thread within a node.=20

The interesting cases happen if a new node is added to the cluster or if =
a workflow engine instance fails. Let=E2=80=99s talk about failures =
first. As processing a workflow instance takes some time we cannot =
simply switch over to a new instance right away. After all, the previous =
node might still be running (i.e. it did not crash, just failed to =
heartbeat). We would have to wait some time (how long?!) until we know =
that the failing node has dropped the work it was doing (it gets =
notified of the session expiration). The failing node cannot communicate =
with Zookeeper so it has no way of telling that it was done. We also =
cannot use the central database to find out what happened. If a node =
fails (e.g. due to hardware failure or network failure) it cannot update =
the database. The database will look the same regardless of whether it =
is working or has crashed. It must be impossible that two workflow =
engines work on the same workflow instance. This would result in =
duplicate messages being sent out to backend systems (not a good idea).=20=


The same issue arrises with adding a node. The new node has to wait =
until the other nodes have stopped working. Ok, I could use Zookeeper in =
this case to lock workflow instances. Only if the lock is available a =
node can start working on an workflow instance. Whenever a node is added =
it will look up pending workflow instances in the database (those that =
have open events).=20

To summarize: instead of having a central master that coordinates all =
work I would like to divide the work =E2=80=9Cspace=E2=80=9D into =
segments by using consistent hashing. Each node is an equal peer and can =
operate freely within its assigned work =E2=80=9Cspace=E2=80=9D.=20

What do you think about that setup? Is it something completely stupid?! =
If yes, let me know why. Has somebody done something similar =
successfully? What would I have to watch out for? What would I need to =
add to the basic setup mentioned above?=20

Regards,
Simon=