zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jzimmer...@netflix.com>
Subject Re: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 16:56:22 GMT
I agree - I think what I'm doing is a mistake and I'm going to rethink it.

-JZ

On 10/14/11 9:54 AM, "Fournier, Camille F." <Camille.Fournier@gs.com>
wrote:

>But how can you know that NODEEXISTS indicates success? I see in your
>code that you only swallow it if you retried the call. But there's no way
>to know if in fact the NODEEXISTS was due to your code trying to create
>the node twice (and swallowing the first creation in a disconnected
>error), or a disconnected error causing a retry that hit a NODEEXISTS due
>to another client creating the same node.
>
>That's what I mean by no solution that works for every use case.
>
>C
>
>-----Original Message-----
>From: Jordan Zimmerman [mailto:jzimmerman@netflix.com]
>Sent: Friday, October 14, 2011 12:39 PM
>To: user@zookeeper.apache.org; 'Mike Schilli'
>Subject: Re: Locks based on ephemeral nodes - Handling network outage
>correctly
>
>FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry loop
>and just ignores it treating it as a success. I'm not sure if other
>libraries do that. So, this is a case that a disconnection can be handled
>generically.
>
>-JZ
>
>On 10/14/11 7:20 AM, "Fournier, Camille F." <Camille.Fournier@gs.com>
>wrote:
>
>>Pretty much all of the Java client wrappers out there in the wild have
>>some sort of a retry loop around operations, to make some of this easier
>>to deal with. But they don't to my knowledge deal with the situation of
>>knowing whether an operation succeeded in the case of a disconnect (it is
>>possible to push out a request, and get a disconnect back before you get
>>a response for that request so you don't know if your request succeeded
>>or failed). So you may end up, for example, writing something twice in
>>the case of writing a SEQUENTIAL-type node. For many use cases of
>>sequential, this isn't a big deal.
>>
>>I don't know of anything that handles this in a more subtle way than
>>simply retrying. As Ted has mentioned in earlier emails on the subject,
>>" You can't just assume that you can retry an operation on Zookeeper and
>>get the right result.  The correct handling is considerably more subtle.
>>Hiding that is not a good thing unless you say right up front that you
>>are compromising either expressivity (as does Kept Collections) or
>>correctness (as does zkClient)."
>>
>>It's not clear to me that it is possible to write a generic client to
>>"correctly" handle retries on disconnect because what correct means
>>varies from use case to use case. One of the challenges I think for
>>getting comfortable with using ZK is knowing the correctness bounds for
>>your particular use case and understanding the failure scenarios wrt that
>>use case and ZK. 
>>
>>C
>>
>>
>>-----Original Message-----
>>From: Mike Schilli [mailto:m@perlmeister.com]
>>Sent: Thursday, October 13, 2011 9:27 PM
>>To: user@zookeeper.apache.org
>>Subject: Re: Locks based on ephemeral nodes - Handling network outage
>>correctly
>>
>>On Wed, 12 Oct 2011, Ted Dunning wrote:
>>
>>> ZK will tell you when the connection is lost (but not yet expired).
>>>When
>>> this happens, the application needs to pay attention and pause before
>>> continuing to assume it still has the lock.
>>
>>I think this applies to every write operation in ZooKeeper, which I find
>>is a challenge to deal with.
>>
>>So basically, every time an application writes something to ZooKeeper,
>>it needs to check the result, but what to do if it fails? Check if it's
>>an error indicating the connection was lost, and try a couple of times
>>to reinstate the connection and replay the write? At least, that's what
>>the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
>>
>>Are there best practices around this, or, better yet, a client API that
>>actually implements this, so the application doesn't have to implement
>>a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
>>in between and fail otherwise"`.
>>
>>-- -- Mike
>>
>>Mike Schilli
>>m@perlmeister.com
>>
>>
>>
>>>
>>> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>>>
>>>> Hello all,
>>>>
>>>> There is something that bother me about ephemeral nodes.
>>>>
>>>> I need to create some locks using Zookeeper. I followed the "official"
>>>> recipe, except that I don't use the EPHEMERAL flag. The reason for
>>>>that
>>>> is that I don't know how I should proceed if the connection to
>>>>Zookeeper
>>>> ensemble is ever lost. But otherwise, everything works nicely.
>>>>
>>>> The EPHEMERAL flag is useful if the owner of the lock disappear
>>>>(exiting
>>>> abnormally). From the point of view of the Zookeeper ensemble, the
>>>> connection time out (or is closed explicitly), the lock is released.
>>>> That's great.
>>>>
>>>> However, if I lose the connection temporarily (network outage), the
>>>> Zookeeper ensemble again see the connection timing out.. but actually
>>>> the owner of the lock is still there doing some work on the locked
>>>> resource. But the lock is released by Zookeeper anyway.
>>>>
>>>> How should this case be handled?
>>>>
>>>> All I can see is that the owner can only verify that the lock was no
>>>> longer owned because releasing the lock will give a Session Expired
>>>> error (assuming we retry reconnecting while we get a Connection Loss
>>>> error) or because of an event sent at some point because the
>>>>connection
>>>> was also closed automatically on the client side by libkeeper (not
>>>>sure
>>>> about this last point). Knowing that the connection expired necessary
>>>> mean that the lock was lost but it may be too late.
>>>>
>>>> I mean that there is a short time lapse where the process that own the
>>>> lock have not tried to release it yet and thus don't know it lost it,
>>>> and another process was able to acquire it too in the meantime. This
>>>>is
>>>> a big problem.
>>>>
>>>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
>>>> periodic cleaning task to drop locks no longer owned by some process
>>>>(a
>>>> task which is not trivial either.)
>>>>
>>>> I would appreciate any tips to handle such situation in a better way.
>>>> What is your experience in such cases?
>>>>
>>>> Regards,
>>>>
>>>> --
>>>> Frédéric Jolliton
>>>> Outscale SAS
>>>>
>>>>
>>>
>>
>
>


Mime
View raw message