zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fournier, Camille F." <Camille.Fourn...@gs.com>
Subject RE: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 16:54:45 GMT
But how can you know that NODEEXISTS indicates success? I see in your code that you only swallow
it if you retried the call. But there's no way to know if in fact the NODEEXISTS was due to
your code trying to create the node twice (and swallowing the first creation in a disconnected
error), or a disconnected error causing a retry that hit a NODEEXISTS due to another client
creating the same node.

That's what I mean by no solution that works for every use case. 


-----Original Message-----
From: Jordan Zimmerman [mailto:jzimmerman@netflix.com] 
Sent: Friday, October 14, 2011 12:39 PM
To: user@zookeeper.apache.org; 'Mike Schilli'
Subject: Re: Locks based on ephemeral nodes - Handling network outage correctly

FYI - Curator checks for KeeperException.Code.NODEEXISTS in its retry loop
and just ignores it treating it as a success. I'm not sure if other
libraries do that. So, this is a case that a disconnection can be handled


On 10/14/11 7:20 AM, "Fournier, Camille F." <Camille.Fournier@gs.com>

>Pretty much all of the Java client wrappers out there in the wild have
>some sort of a retry loop around operations, to make some of this easier
>to deal with. But they don't to my knowledge deal with the situation of
>knowing whether an operation succeeded in the case of a disconnect (it is
>possible to push out a request, and get a disconnect back before you get
>a response for that request so you don't know if your request succeeded
>or failed). So you may end up, for example, writing something twice in
>the case of writing a SEQUENTIAL-type node. For many use cases of
>sequential, this isn't a big deal.
>I don't know of anything that handles this in a more subtle way than
>simply retrying. As Ted has mentioned in earlier emails on the subject,
>" You can't just assume that you can retry an operation on Zookeeper and
>get the right result.  The correct handling is considerably more subtle.
>Hiding that is not a good thing unless you say right up front that you
>are compromising either expressivity (as does Kept Collections) or
>correctness (as does zkClient)."
>It's not clear to me that it is possible to write a generic client to
>"correctly" handle retries on disconnect because what correct means
>varies from use case to use case. One of the challenges I think for
>getting comfortable with using ZK is knowing the correctness bounds for
>your particular use case and understanding the failure scenarios wrt that
>use case and ZK. 
>-----Original Message-----
>From: Mike Schilli [mailto:m@perlmeister.com]
>Sent: Thursday, October 13, 2011 9:27 PM
>To: user@zookeeper.apache.org
>Subject: Re: Locks based on ephemeral nodes - Handling network outage
>On Wed, 12 Oct 2011, Ted Dunning wrote:
>> ZK will tell you when the connection is lost (but not yet expired).
>> this happens, the application needs to pay attention and pause before
>> continuing to assume it still has the lock.
>I think this applies to every write operation in ZooKeeper, which I find
>is a challenge to deal with.
>So basically, every time an application writes something to ZooKeeper,
>it needs to check the result, but what to do if it fails? Check if it's
>an error indicating the connection was lost, and try a couple of times
>to reinstate the connection and replay the write? At least, that's what
>the documentation of the Perl Wrapper in Net::ZooKeeper suggests.
>Are there best practices around this, or, better yet, a client API that
>actually implements this, so the application doesn't have to implement
>a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
>in between and fail otherwise"`.
>-- -- Mike
>Mike Schilli
>> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>>> Hello all,
>>> There is something that bother me about ephemeral nodes.
>>> I need to create some locks using Zookeeper. I followed the "official"
>>> recipe, except that I don't use the EPHEMERAL flag. The reason for that
>>> is that I don't know how I should proceed if the connection to
>>> ensemble is ever lost. But otherwise, everything works nicely.
>>> The EPHEMERAL flag is useful if the owner of the lock disappear
>>> abnormally). From the point of view of the Zookeeper ensemble, the
>>> connection time out (or is closed explicitly), the lock is released.
>>> That's great.
>>> However, if I lose the connection temporarily (network outage), the
>>> Zookeeper ensemble again see the connection timing out.. but actually
>>> the owner of the lock is still there doing some work on the locked
>>> resource. But the lock is released by Zookeeper anyway.
>>> How should this case be handled?
>>> All I can see is that the owner can only verify that the lock was no
>>> longer owned because releasing the lock will give a Session Expired
>>> error (assuming we retry reconnecting while we get a Connection Loss
>>> error) or because of an event sent at some point because the connection
>>> was also closed automatically on the client side by libkeeper (not sure
>>> about this last point). Knowing that the connection expired necessary
>>> mean that the lock was lost but it may be too late.
>>> I mean that there is a short time lapse where the process that own the
>>> lock have not tried to release it yet and thus don't know it lost it,
>>> and another process was able to acquire it too in the meantime. This is
>>> a big problem.
>>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
>>> periodic cleaning task to drop locks no longer owned by some process (a
>>> task which is not trivial either.)
>>> I would appreciate any tips to handle such situation in a better way.
>>> What is your experience in such cases?
>>> Regards,
>>> --
>>> Frédéric Jolliton
>>> Outscale SAS

View raw message