zookeeper-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Mike Schilli...@perlmeister.com>
Subject Re: Locks based on ephemeral nodes - Handling network outage correctly
Date Fri, 14 Oct 2011 01:27:01 GMT
On Wed, 12 Oct 2011, Ted Dunning wrote:

> ZK will tell you when the connection is lost (but not yet expired).  When
> this happens, the application needs to pay attention and pause before
> continuing to assume it still has the lock.

I think this applies to every write operation in ZooKeeper, which I find
is a challenge to deal with.

So basically, every time an application writes something to ZooKeeper,
it needs to check the result, but what to do if it fails? Check if it's
an error indicating the connection was lost, and try a couple of times
to reinstate the connection and replay the write? At least, that's what
the documentation of the Perl Wrapper in Net::ZooKeeper suggests.

Are there best practices around this, or, better yet, a client API that
actually implements this, so the application doesn't have to implement
a ZooKeeper wrapper? Something like "retry 3 times with 10 second waits
in between and fail otherwise"`.

-- -- Mike

Mike Schilli

> 2011/10/12 Frédéric Jolliton <frederic@jolliton.com>
>> Hello all,
>> There is something that bother me about ephemeral nodes.
>> I need to create some locks using Zookeeper. I followed the "official"
>> recipe, except that I don't use the EPHEMERAL flag. The reason for that
>> is that I don't know how I should proceed if the connection to Zookeeper
>> ensemble is ever lost. But otherwise, everything works nicely.
>> The EPHEMERAL flag is useful if the owner of the lock disappear (exiting
>> abnormally). From the point of view of the Zookeeper ensemble, the
>> connection time out (or is closed explicitly), the lock is released.
>> That's great.
>> However, if I lose the connection temporarily (network outage), the
>> Zookeeper ensemble again see the connection timing out.. but actually
>> the owner of the lock is still there doing some work on the locked
>> resource. But the lock is released by Zookeeper anyway.
>> How should this case be handled?
>> All I can see is that the owner can only verify that the lock was no
>> longer owned because releasing the lock will give a Session Expired
>> error (assuming we retry reconnecting while we get a Connection Loss
>> error) or because of an event sent at some point because the connection
>> was also closed automatically on the client side by libkeeper (not sure
>> about this last point). Knowing that the connection expired necessary
>> mean that the lock was lost but it may be too late.
>> I mean that there is a short time lapse where the process that own the
>> lock have not tried to release it yet and thus don't know it lost it,
>> and another process was able to acquire it too in the meantime. This is
>> a big problem.
>> That's why I avoid the EPHEMERAL flag for now, and plan to rely on
>> periodic cleaning task to drop locks no longer owned by some process (a
>> task which is not trivial either.)
>> I would appreciate any tips to handle such situation in a better way.
>> What is your experience in such cases?
>> Regards,
>> --
>> Frédéric Jolliton
>> Outscale SAS
  • Unnamed multipart/mixed (inline, None, 0 bytes)
View raw message