curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Jones (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-62) Leader Election Deadlock
Date Tue, 08 Oct 2013 01:52:42 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13788816#comment-13788816
] 

Doug Jones commented on CURATOR-62:
-----------------------------------

I came across this bug because I've run into it in production. Perhaps I have a weird use
case, but what I'm essentially doing is running an election on a fixed schedule (say every
1 hour). When its work is finished, the leader signals for shutdown and then exits. There's
a race condition between another thread trying to become the leader and handling the shutdown
event that can result in the deadlock described here for future elections. This wouldn't be
an issue if leaderSelector#close was completely reliable.

I can probably work around this bug, but it's definitely an issue for repeated elections.

> Leader Election Deadlock
> ------------------------
>
>                 Key: CURATOR-62
>                 URL: https://issues.apache.org/jira/browse/CURATOR-62
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>            Reporter: Doug Jones
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>             Fix For: TBD
>
>
> I've noticed that it is possible for a leader election to deadlock if a thread is interrupted
while it is trying to acquire the mutex for the election.
> I've created a forced example of this here: https://github.com/dfjones/curator/commit/544220b1e6b51c2718a7d3511a74962ff1c5ff48
> You can see deadlock by using my modified code and running the LeaderSelectorExample.
Some leaders may execute, but on my system I eventually see deadlock. Note that I only see
deadlock when running against a remote zk server rather than the embedded test server. I'm
using Zookeeper 3.4.5 on Mac OS X 10.8.4.
> From what I can tell by inspecting the ZK state/watching in the debugger, the thread
that is interrupted is able to successfully create the lock object in ZK. However, due to
the interrupt an exception is generated and LockInternals#internalLockLoop never runs. Later,
in LeaderSelector#doWork when mutex.release() is called this fails at the for lockData.
> Once this occurs, the lock object in ZK is the oldest and will cause deadlock.



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message