Mailing-List: contact user-help@zookeeper.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@zookeeper.apache.org
Received-SPF: pass (athena.apache.org: domain of jzimmerman@netflix.com
 designates 69.53.237.163 as permitted sender)
DomainKey-Signature: a=rsa-sha1; q=dns; c=nofws;
  s=s1024;d=netflix.com;
  h=from:to:subject:date:message-id:in-reply-to:content-type:mime-version;
  b=RFW+FhPA8hvwsMyfQZbUXijjfBhZVNF155pdQfRXU6r6rBOorVk6HcEDkaV/mh6NwyG7+tXA
    dsINNhQW51H7/FpMCs74eKvB+Nuu5FY/eJ3/cJUWLZV3N2VHXbtbzooafQLRfs70+6faOkX5
    OPT+zeCnHEbk9TRI0Xxr0EkT7N0=
From: Jordan Zimmerman <jzimmerman@netflix.com>
To: "user@zookeeper.apache.org" <user@zookeeper.apache.org>
Subject: Re: Use cases for ZooKeeper
Thread-Topic: Use cases for ZooKeeper
Thread-Index: 
 AQHMy1/6Z+uhdwXXnkSigtVgC/bZeJX9VAhAgACci4D//3vNAIAAkTyAgAsRqwCAARNvgP//lrKA
Date: Fri, 13 Jan 2012 04:01:54 +0000
Message-ID: <CB34ECA6.13028%jzimmerman@netflix.com>
In-Reply-To: 
 <CAJwFCa3YOR+9E4k5tVvzgNCzvxdtzA9bELc47zReEky57sbntA@mail.gmail.com>
Accept-Language: en-US
Content-Language: en-US
user-agent: Microsoft-MacOutlook/14.10.0.110310
Content-Type: text/plain; charset="us-ascii"
Content-ID: <BB60B8AF6AD54040A21934890717D4C7@netflix.com>
Content-Transfer-Encoding: quoted-printable
MIME-Version: 1.0

Sure - give me what you have and I'll port it to Curator.

On 1/12/12 6:18 PM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

>I think I have a bit of it written already.
>
>It doesn't use Curator and I think you could simplify it substantially if
>you were to use it.  Would that help?
>
>On Thu, Jan 12, 2012 at 12:52 PM, Jordan Zimmerman
><jzimmerman@netflix.com>wrote:
>
>> Ted - are you interested in writing this on top of Curator? If not, I'll
>> give it a whack.
>>
>> -JZ
>>
>> On 1/5/12 12:50 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>>
>> >Jordan, I don't think that leader election does what Josh wants.
>> >
>> >I don't think that consistent hashing is particularly good for that
>>either
>> >because the loss of one node causes the sequential state for lots of
>> >entities to move even among nodes that did not fail.
>> >
>> >What I would recommend is a variant of micro-sharding.  The key space
>>is
>> >divided into many micro-shards.  Then nodes that are alive claim the
>> >micro-shards using ephemerals and proceed as Josh described.  On loss
>>of a
>> >node, the shards that node was handling should be claimed by the
>>remaining
>> >nodes.  When a new node appears or new work appears, it is helpful to
>> >direct nodes to effect a hand-off of traffic.
>> >
>> >In my experience, the best way to implement shard balancing is with and
>> >external master instance much in the style of hbase or katta.  This
>> >external master can be exceedingly simple and only needs to wake up on
>> >various events like loss of a node or change in the set of live shards.
>> >It
>> >can also wake up at intervals if desired to backstop the normal
>> >notifications or to allow small changes for certain kinds of balancing.
>> > Typically, this only requires a few hundred lines of code.
>> >
>> >This external master can, of course, be run on multiple nodes and which
>> >master is in current control can be adjudicated with yet another leader
>> >election.
>> >
>> >You can view this as a package of many leader elections.  Or as
>> >discretized
>> >consistent hashing.  The distinctions are a bit subtle but are very
>> >important.  These include,
>> >
>> >- there is a clean division of control between the master which
>>determines
>> >who serves what and the nodes that do the serving
>> >
>> >- there is no herd effect because the master drives the assignments
>> >
>> >- node loss causes the minimum amount of change of assignments since no
>> >assignments to surviving nodes are disturbed.  This is a major win.
>> >
>> >- balancing is pretty good because there are many shards compared to
>>the
>> >number of nodes.
>> >
>> >- the balancing strategy is highly pluggable.
>> >
>> >This pattern would make a nice addition to Curator, actually.  It
>>comes up
>> >repeatedly in different contexts.
>> >
>> >On Thu, Jan 5, 2012 at 12:11 AM, Jordan Zimmerman
>> ><jzimmerman@netflix.com>wrote:
>> >
>> >> OK - so this is two options for doing the same thing. You use a
>>Leader
>> >> Election algorithm to make sure that only one node in the cluster is
>> >> operating on a work unit. Curator has an implementation (it's really
>> >>just
>> >> a distributed lock with a slightly different API).
>> >>
>> >> -JZ
>> >>
>> >> On 1/5/12 12:04 AM, "Josh Stone" <pacesysjosh@gmail.com> wrote:
>> >>
>> >> >Thanks for the response. Comments below:
>> >> >
>> >> >On Wed, Jan 4, 2012 at 10:46 PM, Jordan Zimmerman
>> >> ><jzimmerman@netflix.com>wrote:
>> >> >
>> >> >> Hi Josh,
>> >> >>
>> >> >> >Second use case: Distributed locking
>> >> >> This is one of the most common uses of ZooKeeper. There are many
>> >> >> implementations - one included with the ZK distro. Also, there is
>> >> >>Curator:
>> >> >> https://github.com/Netflix/curator
>> >> >>
>> >> >> >First use case: Distributing work to a cluster of nodes
>> >> >> This sounds feasible. If you give more details I and others on
>>this
>> >>list
>> >> >> can help more.
>> >> >>
>> >> >
>> >> >Sure. I basically want to handle race conditions where two commands
>> >>that
>> >> >operate on the same data are received by my cluster of znodes,
>> >> >concurrently. One approach is to lock on the data that is effected
>>by
>> >>the
>> >> >command (distributed lock). Another approach is make sure that all
>>of
>> >>the
>> >> >commands that operate on any set of data are routed to the same
>>node,
>> >> >where
>> >> >they can be processed serially using local synchronization.
>>Consistent
>> >> >hashing is an algorithm that can be used to select a node to handle
>>a
>> >> >message (where the inputs are the key to hash and the number of
>>nodes
>> >>in
>> >> >the cluster).
>> >> >
>> >> >There are various implementations for this floating around. I'm just
>> >> >interesting to know how this is working for anyone else.
>> >> >
>> >> >Josh
>> >> >
>> >> >
>> >> >>
>> >> >> -JZ
>> >> >>
>> >> >> ________________________________________
>> >> >> From: Josh Stone [pacesysjosh@gmail.com]
>> >> >> Sent: Wednesday, January 04, 2012 8:09 PM
>> >> >> To: user@zookeeper.apache.org
>> >> >> Subject: Use cases for ZooKeeper
>> >> >>
>> >> >> I have a few use cases that I'm wondering if ZooKeeper would be
>> >>suitable
>> >> >> for and would appreciate some feedback.
>> >> >>
>> >> >> First use case: Distributing work to a cluster of nodes using
>> >>consistent
>> >> >> hashing to ensure that messages of some type are consistently
>> >>handled by
>> >> >> the same node. I haven't been able to find any info about
>>ZooKeeper +
>> >> >> consistent hashing. Is anyone using it for this? A concern here
>> >>would be
>> >> >> how to redistribute work as nodes come and go from the cluster.
>> >> >>
>> >> >> Second use case: Distributed locking. I noticed that there's a
>>recipe
>> >> >>for
>> >> >> this on the ZooKeeper wiki. Is anyone doing this? Any issues? One
>> >> >>concern
>> >> >> would be how to handle orphaned locks if a node that obtained a
>>lock
>> >> >>goes
>> >> >> down.
>> >> >>
>> >> >> Third use case: Fault tolerance. If we utilized ZooKeeper to
>> >>distribute
>> >> >> messages to workers, can it be made to handle a node going down by
>> >> >> re-distributing the work to another node (perhaps messages that
>>are
>> >>not
>> >> >> ack'ed within a timeout are resent)?
>> >> >>
>> >> >> Cheers,
>> >> >> Josh
>> >> >>
>> >>
>> >>
>>
>>