curator-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Chris Jeris <cje...@brightcove.com>
Subject Re: InterProcessMutex acquire times out and then _succeeds_?
Date Thu, 17 Oct 2013 19:43:41 GMT
The issue seems to be caused by the fact that we have changes propagated by
a NodeCache on a different node, and the NodeCacheListener's implementation
attempts to lock the same lock.  Is this an instance of the "single
threaded watcher" problem you're talking about?

Here's an example that shows the problem:
https://gist.github.com/ystael/7030926

The sample output was generated by starting a single instance of the
program.  You can see a couple of cases where, although the MAIN THREAD has
released the lock, the NodeCacheListener thread which is trying to acquire
the lock doesn't succeed until its timeout has elapsed all the way ("took
5012").  It doesn't happen every time, but it happens a substantial
fraction of the time.

If I change the implementation of listener to wrap the body of nodeChanged
in a future, so that it executes in a separate thread, the problem goes
away.  We had thought from examining the implementation of NodeCache that
Listener bodies already execute in a separate background thread; is this
not the case?

thanks, Chris


On Thu, Oct 17, 2013 at 11:08 AM, Jordan Zimmerman <
jordan@jordanzimmerman.com> wrote:

> This sounds suspiciously like a misuse of a watcher somewhere. Are you
> doing other things with ZooKeeper? ZooKeeper watches are single threaded.
> If you have a watcher that blocks, etc. it will prevent other ZooKeeper
> recipes from functioning. See this Tech Note:
> https://cwiki.apache.org/confluence/display/CURATOR/TN1
>
> Other than that, I haven't heard of this problem. If you can provide a
> test case, that would help.
>
> -Jordan
>
> On Oct 17, 2013, at 7:47 AM, Chris Jeris <cjeris@brightcove.com> wrote:
>
> We have run into a knotty problem with InterProcessMutex, where calls to
> .acquire will expend their full timeout and then _succeed_ in acquiring the
> lock.  This generally happens when a different server process was the last
> one to hold the lock, but it does not happen every time that is the case.
>
> The lock is a single InterProcessMutex object per server process (= server
> machine), all on a single persistent Zookeeper node name (the object being
> access controlled is a single piece of shared state, whose data is not
> itself stored in ZK).  The problem arises in the context of a test suite
> where requests to our server cluster are issued serially, so there is
> basically no competing traffic on the lock, although there is traffic on
> the ZK cluster from other applications.  The frequency of acquires on this
> lock is not excessive (order 1 per second), and we are reasonably certain
> our client code is not holding the lock longer than it should.
>
> The problem does not seem to be sensitive to the exact value of the
> timeout.  If we set it to 15 seconds, we see lock acquires taking 15
> seconds and then succeeding; if we set it to 60 seconds, we see them taking
> 60 seconds and then succeeding.
>
> Right now we observe the problem with Curator 2.1.0 against both ZK 3.3.6
> and 3.4.5.
>
> Is this a known or familiar issue?  Does it sound like we're doing
> something wrong?
>
> thanks, Chris Jeris
> --
> Chris Jeris
> cjeris@brightcove.com
> freenode/twitter/github: ystael
>
>
>


-- 
Chris Jeris
cjeris@brightcove.com
freenode/twitter/github: ystael

Mime
View raw message