zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Fournier, Camille F." <Camille.Fourn...@gs.com>
Subject RE: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 14:20:46 GMT
Pretty much all of the Java client wrappers out there in the wild have some sort of a retry
loop around operations, to make some of this easier to deal with. But they don't to my knowledge
deal with the situation of knowing whether an operation succeeded in the case of a disconnect
(it is possible to push out a request, and get a disconnect back before you get a response
for that request so you don't know if your request succeeded or failed). So you may end up,
for example, writing something twice in the case of writing a SEQUENTIAL-type node. For many
use cases of sequential, this isn't a big deal. 

I don't know of anything that handles this in a more subtle way than simply retrying. As Ted
has mentioned in earlier emails on the subject, 
" You can't just assume that you can retry an operation on Zookeeper and get the right result.
 The correct handling is considerably more subtle.  Hiding that is not a good thing unless
you say right up front that you are compromising either expressivity (as does Kept Collections)
or correctness (as does zkClient)."

It's not clear to me that it is possible to write a generic client to "correctly" handle retries
on disconnect because what correct means varies from use case to use case. One of the challenges
I think for getting comfortable with using ZK is knowing the correctness bounds for your particular
use case and understanding the failure scenarios wrt that use case and ZK. 

C


-----Original Message-----
From: Mike Schilli [mailto:m@perlmeister.com] 
Sent: Thursday, October 13, 2011 9:27 PM
To: user@zookeeper.apache.org
Subject: Re: Locks based on ephemeral nodes - Handling network outage correctly

On Wed, 12 Oct 2011, Ted Dunning wrote:

> ZK will tell you when the connection is lost (but not yet expired).  When
> this happens, the application needs to pay attention and pause before
> continuing to assume it still has the lock.

I think this applies to every write operation in ZooKeeper, which I find
is a challenge to deal with.

So basically, every time an application writes something to ZooKeeper,
it needs to check the result, but what to do if it fails? Check if it's
an error indicating the connection was lost, and try a couple of times
to reinstate the connection and replay the write? At least, that's what
the documentation of the Perl Wrapper in Net::ZooKeeper suggests.

Are there best practices around this, or, better yet, a client API that
actually implements this, so the application doesn't have to implement
a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
in between and fail otherwise"`.

-- -- Mike

Mike Schilli
m@perlmeister.com



>
> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>
>> Hello all,
>>
>> There is something that bother me about ephemeral nodes.
>>
>> I need to create some locks using Zookeeper. I followed the "official"
>> recipe, except that I don't use the EPHEMERAL flag. The reason for that
>> is that I don't know how I should proceed if the connection to Zookeeper
>> ensemble is ever lost. But otherwise, everything works nicely.
>>
>> The EPHEMERAL flag is useful if the owner of the lock disappear (exiting
>> abnormally). From the point of view of the Zookeeper ensemble, the
>> connection time out (or is closed explicitly), the lock is released.
>> That's great.
>>
>> However, if I lose the connection temporarily (network outage), the
>> Zookeeper ensemble again see the connection timing out.. but actually
>> the owner of the lock is still there doing some work on the locked
>> resource. But the lock is released by Zookeeper anyway.
>>
>> How should this case be handled?
>>
>> All I can see is that the owner can only verify that the lock was no
>> longer owned because releasing the lock will give a Session Expired
>> error (assuming we retry reconnecting while we get a Connection Loss
>> error) or because of an event sent at some point because the connection
>> was also closed automatically on the client side by libkeeper (not sure
>> about this last point). Knowing that the connection expired necessary
>> mean that the lock was lost but it may be too late.
>>
>> I mean that there is a short time lapse where the process that own the
>> lock have not tried to release it yet and thus don't know it lost it,
>> and another process was able to acquire it too in the meantime. This is
>> a big problem.
>>
>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
>> periodic cleaning task to drop locks no longer owned by some process (a
>> task which is not trivial either.)
>>
>> I would appreciate any tips to handle such situation in a better way.
>> What is your experience in such cases?
>>
>> Regards,
>>
>> --
>> Frédéric Jolliton
>> Outscale SAS
>>
>>
>

Mime
View raw message