zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon <co...@gmx.ch>
Subject Work Sharing with Consistent Hashing
Date Wed, 09 Sep 2015 20:29:26 GMT

I have been reading a lot about Zookeeper lately because we have the requirement to distribute
our workflow engine (for performance and reliability reasons). Currently, we have a single
instance. If it fails, no work is done anymore. One approach mentioned in this mailing list
and online documents is a master - worker setup. The master is made reliable by using multiple
instances and electing a leader. This setup might work unless there are too many events to
be dispatched by a single master. I am a bit concerned about that as currently the work dispatcher
does no IO. It is blazing fast. If the work dispatcher (scheduler) now has to communicate
over the network it might get too slow. So I thought about a different solution and would
like to hear what you think about it.

Let’s say each workflow engine instance (referred as node from now on) registers an ephemeral
znode under /nodes. Each node installs a children watch on /nodes. The nodes uses this information
to populate a hashring. Each node is only responsible for workflow instances that map to their
corresponding part of the hashring. Event notifications would then be dispatched to the correct
node based on the same hashring. The persistence of workflow instances and events would still
be done in a highly available database. Only notifications about events would be dispatched.
Whenever an event notification is received a workflow instance has to be dispatched to a given
worker thread within a node. 

The interesting cases happen if a new node is added to the cluster or if a workflow engine
instance fails. Let’s talk about failures first. As processing a workflow instance takes
some time we cannot simply switch over to a new instance right away. After all, the previous
node might still be running (i.e. it did not crash, just failed to heartbeat). We would have
to wait some time (how long?!) until we know that the failing node has dropped the work it
was doing (it gets notified of the session expiration). The failing node cannot communicate
with Zookeeper so it has no way of telling that it was done. We also cannot use the central
database to find out what happened. If a node fails (e.g. due to hardware failure or network
failure) it cannot update the database. The database will look the same regardless of whether
it is working or has crashed. It must be impossible that two workflow engines work on the
same workflow instance. This would result in duplicate messages being sent out to backend
systems (not a good idea). 

The same issue arrises with adding a node. The new node has to wait until the other nodes
have stopped working. Ok, I could use Zookeeper in this case to lock workflow instances. Only
if the lock is available a node can start working on an workflow instance. Whenever a node
is added it will look up pending workflow instances in the database (those that have open

To summarize: instead of having a central master that coordinates all work I would like to
divide the work “space” into segments by using consistent hashing. Each node is an equal
peer and can operate freely within its assigned work “space”. 

What do you think about that setup? Is it something completely stupid?! If yes, let me know
why. Has somebody done something similar successfully? What would I have to watch out for?
What would I need to add to the basic setup mentioned above? 

View raw message