zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jzimmer...@netflix.com>
Subject Re: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 16:51:19 GMT
Thanks - just double checking that I didn't make a mistake.

On 10/14/11 9:48 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:

>Correct.  So you can't actually do a correct retry logic for some error
>conditions.
>
>On Fri, Oct 14, 2011 at 9:44 AM, Jordan Zimmerman
><jzimmerman@netflix.com>wrote:
>
>> True. But, it wouldn't be possible to get
>>KeeperException.Code.NODEEXISTS
>> for sequential files, right?
>>
>> -JZ
>>
>> On 10/14/11 9:41 AM, "Ted Dunning" <ted.dunning@gmail.com> wrote:
>>
>> >Yes.  That works fine with idempotent operations like creating a
>> >non-sequential file.
>> >
>> >Of course, it doesn't work with sequential files since you don't know
>>who
>> >created any other znodes out there.
>> >
>> >On Fri, Oct 14, 2011 at 9:39 AM, Jordan Zimmerman
>> ><jzimmerman@netflix.com>wrote:
>> >
>> >> FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry
>> >>loop
>> >> and just ignores it treating it as a success. I'm not sure if other
>> >> libraries do that. So, this is a case that a disconnection can be
>> >>handled
>> >> generically.
>> >>
>> >> -JZ
>> >>
>> >> On 10/14/11 7:20 AM, "Fournier, Camille F." <Camille.Fournier@gs.com>
>> >> wrote:
>> >>
>> >> >Pretty much all of the Java client wrappers out there in the wild
>>have
>> >> >some sort of a retry loop around operations, to make some of this
>> >>easier
>> >> >to deal with. But they don't to my knowledge deal with the
>>situation of
>> >> >knowing whether an operation succeeded in the case of a disconnect
>>(it
>> >>is
>> >> >possible to push out a request, and get a disconnect back before you
>> >>get
>> >> >a response for that request so you don't know if your request
>>succeeded
>> >> >or failed). So you may end up, for example, writing something twice
>>in
>> >> >the case of writing a SEQUENTIAL-type node. For many use cases of
>> >> >sequential, this isn't a big deal.
>> >> >
>> >> >I don't know of anything that handles this in a more subtle way than
>> >> >simply retrying. As Ted has mentioned in earlier emails on the
>>subject,
>> >> >" You can't just assume that you can retry an operation on Zookeeper
>> >>and
>> >> >get the right result.  The correct handling is considerably more
>> >>subtle.
>> >> >Hiding that is not a good thing unless you say right up front that
>>you
>> >> >are compromising either expressivity (as does Kept Collections) or
>> >> >correctness (as does zkClient)."
>> >> >
>> >> >It's not clear to me that it is possible to write a generic client
>>to
>> >> >"correctly" handle retries on disconnect because what correct means
>> >> >varies from use case to use case. One of the challenges I think for
>> >> >getting comfortable with using ZK is knowing the correctness bounds
>>for
>> >> >your particular use case and understanding the failure scenarios wrt
>> >>that
>> >> >use case and ZK.
>> >> >
>> >> >C
>> >> >
>> >> >
>> >> >-----Original Message-----
>> >> >From: Mike Schilli [mailto:m@perlmeister.com]
>> >> >Sent: Thursday, October 13, 2011 9:27 PM
>> >> >To: user@zookeeper.apache.org
>> >> >Subject: Re: Locks based on ephemeral nodes - Handling network
>>outage
>> >> >correctly
>> >> >
>> >> >On Wed, 12 Oct 2011, Ted Dunning wrote:
>> >> >
>> >> >> ZK will tell you when the connection is lost (but not yet
>>expired).
>> >> >>When
>> >> >> this happens, the application needs to pay attention and pause
>>before
>> >> >> continuing to assume it still has the lock.
>> >> >
>> >> >I think this applies to every write operation in ZooKeeper, which I
>> >>find
>> >> >is a challenge to deal with.
>> >> >
>> >> >So basically, every time an application writes something to
>>ZooKeeper,
>> >> >it needs to check the result, but what to do if it fails? Check if
>>it's
>> >> >an error indicating the connection was lost, and try a couple of
>>times
>> >> >to reinstate the connection and replay the write? At least, that's
>>what
>> >> >the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
>> >> >
>> >> >Are there best practices around this, or, better yet, a client API
>>that
>> >> >actually implements this, so the application doesn't have to
>>implement
>> >> >a ZooKeeper wrapper? Something like "retry 3 times with 10 second
>>waits
>> >> >in between and fail otherwise"`.
>> >> >
>> >> >-- -- Mike
>> >> >
>> >> >Mike Schilli
>> >> >m@perlmeister.com
>> >> >
>> >> >
>> >> >
>> >> >>
>> >> >> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>> >> >>
>> >> >>> Hello all,
>> >> >>>
>> >> >>> There is something that bother me about ephemeral nodes.
>> >> >>>
>> >> >>> I need to create some locks using Zookeeper. I followed the
>> >>"official"
>> >> >>> recipe, except that I don't use the EPHEMERAL flag. The reason
>>for
>> >>that
>> >> >>> is that I don't know how I should proceed if the connection
to
>> >> >>>Zookeeper
>> >> >>> ensemble is ever lost. But otherwise, everything works nicely.
>> >> >>>
>> >> >>> The EPHEMERAL flag is useful if the owner of the lock disappear
>> >> >>>(exiting
>> >> >>> abnormally). From the point of view of the Zookeeper ensemble,
>>the
>> >> >>> connection time out (or is closed explicitly), the lock is
>>released.
>> >> >>> That's great.
>> >> >>>
>> >> >>> However, if I lose the connection temporarily (network outage),
>>the
>> >> >>> Zookeeper ensemble again see the connection timing out.. but
>> >>actually
>> >> >>> the owner of the lock is still there doing some work on the
>>locked
>> >> >>> resource. But the lock is released by Zookeeper anyway.
>> >> >>>
>> >> >>> How should this case be handled?
>> >> >>>
>> >> >>> All I can see is that the owner can only verify that the lock
>>was no
>> >> >>> longer owned because releasing the lock will give a Session
>>Expired
>> >> >>> error (assuming we retry reconnecting while we get a Connection
>>Loss
>> >> >>> error) or because of an event sent at some point because the
>> >>connection
>> >> >>> was also closed automatically on the client side by libkeeper
>>(not
>> >>sure
>> >> >>> about this last point). Knowing that the connection expired
>> >>necessary
>> >> >>> mean that the lock was lost but it may be too late.
>> >> >>>
>> >> >>> I mean that there is a short time lapse where the process that
>>own
>> >>the
>> >> >>> lock have not tried to release it yet and thus don't know it
lost
>> >>it,
>> >> >>> and another process was able to acquire it too in the meantime.
>> >>This is
>> >> >>> a big problem.
>> >> >>>
>> >> >>> That's why I avoid the EPHEMERAL flag for now, and plan to
rely
>>on
>> >> >>> periodic cleaning task to drop locks no longer owned by some
>> >>process (a
>> >> >>> task which is not trivial either.)
>> >> >>>
>> >> >>> I would appreciate any tips to handle such situation in a better
>> >>way.
>> >> >>> What is your experience in such cases?
>> >> >>>
>> >> >>> Regards,
>> >> >>>
>> >> >>> --
>> >> >>> Frédéric Jolliton
>> >> >>> Outscale SAS
>> >> >>>
>> >> >>>
>> >> >>
>> >> >
>> >>
>> >>
>>
>>


Mime
View raw message