curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Peterson <quu...@gmail.com>
Subject InterProcessMutex doesn't detect deletion of lock file
Date Tue, 20 Jan 2015 15:23:13 GMT
Hi,

I am fairly new to Curator and ZK, so apologies if this is has been asked
before.  I haven't found anything yet that addresses it.

My ZK use case is very simple - HA failover.  Two processes get launched -
one does the work and the other waits to take over in case the other dies
or otherwise stops working.

The Curator InterProcessMutex fits the bill.  However, without too much
effort I've found a scenario where both Process A and Process B both think
they are the owner at the same time and start doing the work, causing data
corruption.

The scenario is simply to delete the lock file, which I did via the ZK CLI
(zkCli.sh).  The problem is that the InterProcessMutex currently holding
the lock doesn't seem to notice that the lock file got deleted, but the
InterProcessMutex in the waiting (failover) process *does* notice and
creates a new lock and starts doing work.

Does the InterProcessMutex set a watch on the lock file it creates?  If
not, why not?


Idea #1:

I tried setting all the Listeners I could figure how to set to detect the
NodeDeleted event:

- CuratorListener
- ConnectionStateListener
- UnhandledErrorListener

but none get signaled when I manually delete the lock file.


Idea #2:

Is the solution to set my own watch on the lock file that the IPMutex
created?  If so, I see that one way to get the file name of the lock is to
call InterProcessMutex#getParticipantNodes().  But the problem is that
there can be more than one lock file - it seems

    [zk: localhost:2181(CONNECTED) 7] ls /XXX/masterlock
    [_c_c1dc399d-b6e4-4051-bd5c-2e300e62bc58-lock-0000000003,
_c_bf5de8b2-ed33-4f89-a737-4061f2072c3f-lock-0000000000]

    [zk: localhost:2181(CONNECTED) 37] ls /XXX/masterlock
    [_c_63490235-7ab6-461d-bab2-401d4439db4f-lock-0000000018, \
     _c_1e57c64e-b990-4f9a-96f9-fccf56c0421e-lock-0000000012, \
     _c_f09ee1e5-0e47-47a7-961e-d7745ffbfc28-lock-0000000017, \
     _c_2f9ebe06-b91c-4886-b916-34ff1fa83541-lock-0000000016]

And it seems that I can't use the one with the smallest sequential lock
number, because the smallest one might be hanging around from a crashed
lockholder and it has expired yet - that is the case in the above example:
lock-00000012 is just waiting to be expired after a crash.

So I don't know how to tell which lock is "mine" to set a watch on using
that method.



Idea #3:

I see that the InterProcessMutex also takes an optional
`LockInternalsDriver` argument.  I looked into that code and there I see
that it has access to the lock file name.  In addition, in the
`getsTheLock` method it creates a PredicateResults object with a
`pathToWatch` arg, which sounds promising, but in the default impl with my
setup that pathToWatch is null.

So I then created my own CustomLockInternalsDriver and put the lock-file
name in pathToWatch (not sure that would work), but when I set
`pathToWatch` to the actual lock path, still nothing happens when I delete
the file.

So then I recorded the path to my lock in the CustomLockInternalsDriver so
I could get it in my mainline code and set a WATCH manually/myself.  That
ends up working.  But that's a lot of work and it's not at all clear what
the right solution is and whether it is dangerous to fiddle with creating
my own LockInternalsDriver impl.

What is the right way to solve this issue?


--- How to REPRODUCE ---

Here's a link to a gist with my test code:
https://gist.github.com/quux00/f6be8fe223a7832ef514
Also a gist to my CustomLockInternalsDriver:
https://gist.github.com/quux00/ab37cedc46cb5368c853

Start up two instances of that code. One will indicate it is "working" and
the other "waiting". I then use zkCli.sh to delete the file:

    $ ./zkCli.sh
    [zk: localhost:2181(CONNECTED) 111] ls /XXX/masterlock
    [_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006]
    [zk: localhost:2181(CONNECTED) 112] delete
/XXX/masterlock/_c_fd2dcb51-d5e1-4f27-afdf-7a8f75c1b85b-lock-0000000006
    [zk: localhost:2181(CONNECTED) 113] ls /XXX/masterlock
    []

The "waiting" process will now create a new lock file and now both
processes are "working".

Thank you,
Michael

Mime
View raw message