zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jzimmer...@netflix.com>
Subject Re: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 16:39:00 GMT
FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry loop
and just ignores it treating it as a success. I'm not sure if other
libraries do that. So, this is a case that a disconnection can be handled
generically.

-JZ

On 10/14/11 7:20 AM, "Fournier, Camille F." <Camille.Fournier@gs.com>
wrote:

>Pretty much all of the Java client wrappers out there in the wild have
>some sort of a retry loop around operations, to make some of this easier
>to deal with. But they don't to my knowledge deal with the situation of
>knowing whether an operation succeeded in the case of a disconnect (it is
>possible to push out a request, and get a disconnect back before you get
>a response for that request so you don't know if your request succeeded
>or failed). So you may end up, for example, writing something twice in
>the case of writing a SEQUENTIAL-type node. For many use cases of
>sequential, this isn't a big deal.
>
>I don't know of anything that handles this in a more subtle way than
>simply retrying. As Ted has mentioned in earlier emails on the subject,
>" You can't just assume that you can retry an operation on Zookeeper and
>get the right result.  The correct handling is considerably more subtle.
>Hiding that is not a good thing unless you say right up front that you
>are compromising either expressivity (as does Kept Collections) or
>correctness (as does zkClient)."
>
>It's not clear to me that it is possible to write a generic client to
>"correctly" handle retries on disconnect because what correct means
>varies from use case to use case. One of the challenges I think for
>getting comfortable with using ZK is knowing the correctness bounds for
>your particular use case and understanding the failure scenarios wrt that
>use case and ZK. 
>
>C
>
>
>-----Original Message-----
>From: Mike Schilli [mailto:m@perlmeister.com]
>Sent: Thursday, October 13, 2011 9:27 PM
>To: user@zookeeper.apache.org
>Subject: Re: Locks based on ephemeral nodes - Handling network outage
>correctly
>
>On Wed, 12 Oct 2011, Ted Dunning wrote:
>
>> ZK will tell you when the connection is lost (but not yet expired).
>>When
>> this happens, the application needs to pay attention and pause before
>> continuing to assume it still has the lock.
>
>I think this applies to every write operation in ZooKeeper, which I find
>is a challenge to deal with.
>
>So basically, every time an application writes something to ZooKeeper,
>it needs to check the result, but what to do if it fails? Check if it's
>an error indicating the connection was lost, and try a couple of times
>to reinstate the connection and replay the write? At least, that's what
>the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
>
>Are there best practices around this, or, better yet, a client API that
>actually implements this, so the application doesn't have to implement
>a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
>in between and fail otherwise"`.
>
>-- -- Mike
>
>Mike Schilli
>m@perlmeister.com
>
>
>
>>
>> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>>
>>> Hello all,
>>>
>>> There is something that bother me about ephemeral nodes.
>>>
>>> I need to create some locks using Zookeeper. I followed the "official"
>>> recipe, except that I don't use the EPHEMERAL flag. The reason for that
>>> is that I don't know how I should proceed if the connection to
>>>Zookeeper
>>> ensemble is ever lost. But otherwise, everything works nicely.
>>>
>>> The EPHEMERAL flag is useful if the owner of the lock disappear
>>>(exiting
>>> abnormally). From the point of view of the Zookeeper ensemble, the
>>> connection time out (or is closed explicitly), the lock is released.
>>> That's great.
>>>
>>> However, if I lose the connection temporarily (network outage), the
>>> Zookeeper ensemble again see the connection timing out.. but actually
>>> the owner of the lock is still there doing some work on the locked
>>> resource. But the lock is released by Zookeeper anyway.
>>>
>>> How should this case be handled?
>>>
>>> All I can see is that the owner can only verify that the lock was no
>>> longer owned because releasing the lock will give a Session Expired
>>> error (assuming we retry reconnecting while we get a Connection Loss
>>> error) or because of an event sent at some point because the connection
>>> was also closed automatically on the client side by libkeeper (not sure
>>> about this last point). Knowing that the connection expired necessary
>>> mean that the lock was lost but it may be too late.
>>>
>>> I mean that there is a short time lapse where the process that own the
>>> lock have not tried to release it yet and thus don't know it lost it,
>>> and another process was able to acquire it too in the meantime. This is
>>> a big problem.
>>>
>>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
>>> periodic cleaning task to drop locks no longer owned by some process (a
>>> task which is not trivial either.)
>>>
>>> I would appreciate any tips to handle such situation in a better way.
>>> What is your experience in such cases?
>>>
>>> Regards,
>>>
>>> --
>>> Frédéric Jolliton
>>> Outscale SAS
>>>
>>>
>>
>


Mime
View raw message