zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Simon <co...@gmx.ch>
Subject Re: Work Sharing with Consistent Hashing
Date Wed, 09 Sep 2015 21:22:13 GMT
I did read that thread. However, from a quick glance it does not look like to work the way
I described below. Tasks are submitted to Zookeeper and all the state is kept in Zookeeper
(workflow instances are way bigger than what is sensible to keep in Zookeeper). We will want
to keep all the state in another data store. And, NirmataOSS is not employing consistent hashing
for work distribution if I have not missed that somewhere in the codebase.

I am not looking for a product that solves my problem but rather to start a discussion about
how to implement what I have described below with Zookeeper. 

> On 09 Sep 2015, at 22:49 , John Sirois <john.sirois@gmail.com> wrote:
> You may have missed this thread:
> http://mail-archives.apache.org/mod_mbox/zookeeper-user/201509.mbox/%3CCAF+A=7rhAMcGxewhPR0n+z5-RwVxSLxd1ghTTH7CP3ozMVgdRg@mail.gmail.com%3E
> What you describe is implemented, perhaps in different in some details,
> here: http://nirmataoss.github.io/workflow/
> On Wed, Sep 9, 2015 at 2:29 PM, Simon <cocoa@gmx.ch> wrote:
>> Hi
>> I have been reading a lot about Zookeeper lately because we have the
>> requirement to distribute our workflow engine (for performance and
>> reliability reasons). Currently, we have a single instance. If it fails, no
>> work is done anymore. One approach mentioned in this mailing list and
>> online documents is a master - worker setup. The master is made reliable by
>> using multiple instances and electing a leader. This setup might work
>> unless there are too many events to be dispatched by a single master. I am
>> a bit concerned about that as currently the work dispatcher does no IO. It
>> is blazing fast. If the work dispatcher (scheduler) now has to communicate
>> over the network it might get too slow. So I thought about a different
>> solution and would like to hear what you think about it.
>> Let’s say each workflow engine instance (referred as node from now on)
>> registers an ephemeral znode under /nodes. Each node installs a children
>> watch on /nodes. The nodes uses this information to populate a hashring.
>> Each node is only responsible for workflow instances that map to their
>> corresponding part of the hashring. Event notifications would then be
>> dispatched to the correct node based on the same hashring. The persistence
>> of workflow instances and events would still be done in a highly available
>> database. Only notifications about events would be dispatched. Whenever an
>> event notification is received a workflow instance has to be dispatched to
>> a given worker thread within a node.
>> The interesting cases happen if a new node is added to the cluster or if a
>> workflow engine instance fails. Let’s talk about failures first. As
>> processing a workflow instance takes some time we cannot simply switch over
>> to a new instance right away. After all, the previous node might still be
>> running (i.e. it did not crash, just failed to heartbeat). We would have to
>> wait some time (how long?!) until we know that the failing node has dropped
>> the work it was doing (it gets notified of the session expiration). The
>> failing node cannot communicate with Zookeeper so it has no way of telling
>> that it was done. We also cannot use the central database to find out what
>> happened. If a node fails (e.g. due to hardware failure or network failure)
>> it cannot update the database. The database will look the same regardless
>> of whether it is working or has crashed. It must be impossible that two
>> workflow engines work on the same workflow instance. This would result in
>> duplicate messages being sent out to backend systems (not a good idea).
>> The same issue arrises with adding a node. The new node has to wait until
>> the other nodes have stopped working. Ok, I could use Zookeeper in this
>> case to lock workflow instances. Only if the lock is available a node can
>> start working on an workflow instance. Whenever a node is added it will
>> look up pending workflow instances in the database (those that have open
>> events).
>> To summarize: instead of having a central master that coordinates all work
>> I would like to divide the work “space” into segments by using consistent
>> hashing. Each node is an equal peer and can operate freely within its
>> assigned work “space”.
>> What do you think about that setup? Is it something completely stupid?! If
>> yes, let me know why. Has somebody done something similar successfully?
>> What would I have to watch out for? What would I need to add to the basic
>> setup mentioned above?
>> Regards,
>> Simon

View raw message