curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jordan Zimmerman <jor...@jordanzimmerman.com>
Subject Re: InterProcessMutex doesn't detect deletion of lock file
Date Tue, 20 Jan 2015 16:42:21 GMT
But manually deleting the lock node is not normal behavior. It should never happen in production.
Can you explain the scenario in more detail? 

-JZ



On January 20, 2015 at 10:47:20 AM, John Vines (vines@apache.org) wrote:

Sounds similar to https://issues.apache.org/jira/browse/CURATOR-171

On Tue, Jan 20, 2015 at 10:23 AM, Michael Peterson <quux00@gmail.com> wrote:
Hi,

I am fairly new to Curator and ZK, so apologies if this is has been asked before.  I haven't
found anything yet that addresses it.

My ZK use case is very simple - HA failover.  Two processes get launched - one does the work
and the other waits to take over in case the other dies or otherwise stops working.

The Curator InterProcessMutex fits the bill.  However, without too much effort I've found
a scenario where both Process A and Process B both think they are the owner at the same time
and start doing the work, causing data corruption.

The scenario is simply to delete the lock file, which I did via the ZK CLI (zkCli.sh).  The
problem is that the InterProcessMutex currently holding the lock doesn't seem to notice that
the lock file got deleted, but the InterProcessMutex in the waiting (failover) process *does*
notice and creates a new lock and starts doing work.

Does the InterProcessMutex set a watch on the lock file it creates?  If not, why not?


Idea #1:

I tried setting all the Listeners I could figure how to set to detect the NodeDeleted event:

- CuratorListener
- ConnectionStateListener
- UnhandledErrorListener

but none get signaled when I manually delete the lock file.


Idea #2:

Is the solution to set my own watch on the lock file that the IPMutex created?  If so, I
see that one way to get the file name of the lock is to call InterProcessMutex#getParticipantNodes(). 
But the problem is that there can be more than one lock file - it seems

    [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock
    [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003, _c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000]

    [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock
    [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \
     _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \
     _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \
     _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016]

And it seems that I can't use the one with the smallest sequential lock number, because the
smallest one might be hanging around from a crashed lockholder and it has expired yet - that
is the case in the above example: lock-00000012 is just waiting to be expired after a crash.

So I don't know how to tell which lock is "mine" to set a watch on using that method.



Idea #3:

I see that the InterProcessMutex also takes an optional `LockInternalsDriver` argument. 
I looked into that code and there I see that it has access to the lock file name.  In addition,
in the `getsTheLock` method it creates a PredicateResults object with a `pathToWatch` arg,
which sounds promising, but in the default impl with my setup that pathToWatch is null. 

So I then created my own CustomLockInternalsDriver and put the lock-file name in pathToWatch
(not sure that would work), but when I set `pathToWatch` to the actual lock path, still nothing
happens when I delete the file.

So then I recorded the path to my lock in the CustomLockInternalsDriver so I could get it
in my mainline code and set a WATCH manually/myself.  That ends up working.  But that's
a lot of work and it's not at all clear what the right solution is and whether it is dangerous
to fiddle with creating my own LockInternalsDriver impl.

What is the right way to solve this issue?


--- How to REPRODUCE ---

Here's a link to a gist with my test code:   https://gist.github.com/quux00/f6be8fe223a7832ef514
Also a gist to my CustomLockInternalsDriver: https://gist.github.com/quux00/ab37cedc46cb5368c853

Start up two instances of that code. One will indicate it is "working" and the other "waiting".
I then use zkCli.sh to delete the file:

    $ ./zkCli.sh
    [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock
    [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006]
    [zk: localhost:2181(CONNECTED) 112] delete /XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006
    [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock
    []

The "waiting" process will now create a new lock file and now both processes are "working".

Thank you,
Michael



Mime
View raw message