curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From John Vines <vi...@apache.org>
Subject Re: InterProcessMutex doesn't detect deletion of lock file
Date Tue, 20 Jan 2015 15:46:36 GMT
Sounds similar to https://issues.apache.org/jira/browse/CURATOR-171

On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson <quux00@gmail.com> wrote:

> Hi,
>
> I am fairly new to Curator and ZK, so apologies if this is has been asked
> before.  I haven't found anything yet that addresses it.
>
> My ZK use case is very simple - HA failover.  Two processes get launched -
> one does the work and the other waits to take over in case the other dies
> or otherwise stops working.
>
> The Curator InterProcessMutex fits the bill.  However, without too much
> effort I've found a scenario where both Process A and Process B both think
> they are the owner at the same time and start doing the work, causing data
> corruption.
>
> The scenario is simply to delete the lock file, which I did via the ZK CLI
> (zkCli.sh).  The problem is that the InterProcessMutex currently holding
> the lock doesn't seem to notice that the lock file got deleted, but the
> InterProcessMutex in the waiting (failover) process *does* notice and
> creates a new lock and starts doing work.
>
> Does the InterProcessMutex set a watch on the lock file it creates?  If
> not, why not?
>
>
> Idea #1:
>
> I tried setting all the Listeners I could figure how to set to detect the
> NodeDeleted event:
>
> - CuratorListener
> - ConnectionStateListener
> - UnhandledErrorListener
>
> but none get signaled when I manually delete the lock file.
>
>
> Idea #2:
>
> Is the solution to set my own watch on the lock file that the IPMutex
> created?  If so, I see that one way to get the file name of the lock is to
> call InterProcessMutex#getParticipantNodes().  But the problem is that
> there can be more than one lock file - it seems
>
>     [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock
>     [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003,
> _c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000]
>
>     [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock
>     [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \
>      _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \
>      _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \
>      _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016]
>
> And it seems that I can't use the one with the smallest sequential lock
> number, because the smallest one might be hanging around from a crashed
> lockholder and it has expired yet - that is the case in the above example:
> lock-00000012 is just waiting to be expired after a crash.
>
> So I don't know how to tell which lock is "mine" to set a watch on using
> that method.
>
>
>
> Idea #3:
>
> I see that the InterProcessMutex also takes an optional
> `LockInternalsDriver` argument.  I looked into that code and there I see
> that it has access to the lock file name.  In addition, in the
> `getsTheLock` method it creates a PredicateResults object with a
> `pathToWatch` arg, which sounds promising, but in the default impl with my
> setup that pathToWatch is null.
>
> So I then created my own CustomLockInternalsDriver and put the lock-file
> name in pathToWatch (not sure that would work), but when I set
> `pathToWatch` to the actual lock path, still nothing happens when I delete
> the file.
>
> So then I recorded the path to my lock in the CustomLockInternalsDriver so
> I could get it in my mainline code and set a WATCH manually/myself.  That
> ends up working.  But that's a lot of work and it's not at all clear what
> the right solution is and whether it is dangerous to fiddle with creating
> my own LockInternalsDriver impl.
>
> What is the right way to solve this issue?
>
>
> --- How to REPRODUCE ---
>
> Here's a link to a gist with my test code:
> https://gist.github.com/quux00/f6be8fe223a7832ef514
> Also a gist to my CustomLockInternalsDriver:
> https://gist.github.com/quux00/ab37cedc46cb5368c853
>
> Start up two instances of that code. One will indicate it is "working" and
> the other "waiting". I then use zkCli.sh to delete the file:
>
>     $ ./zkCli.sh
>     [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock
>     [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006]
>     [zk: localhost:2181(CONNECTED) 112] delete
> /XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006
>     [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock
>     []
>
> The "waiting" process will now create a new lock file and now both
> processes are "working".
>
> Thank you,
> Michael
>
>

Mime
View raw message