curator-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Doug Jones (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (CURATOR-62) Leader Election Deadlock
Date Thu, 14 May 2015 20:54:00 GMT

    [ https://issues.apache.org/jira/browse/CURATOR-62?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14544345#comment-14544345
] 

Doug Jones commented on CURATOR-62:
-----------------------------------

I was able to retest this and it appears to be fixed on master.

I created a better test at the version where I first noticed this bug:
https://github.com/dfjones/curator/tree/CURATOR-62

This test fails 100% of the time for me. (It might be a good test to include, but I had to
modify LeaderSelector to get the interrupt to occur at the exact time to cause the issue.)

I ported this test to the latest code on master here:
https://github.com/dfjones/curator/tree/CURATOR-62-Latest

This test passes 100% of the time for me. I believe this issue can be closed. It appears the
fix for CURATOR-202 also fixes this issue.

> Leader Election Deadlock
> ------------------------
>
>                 Key: CURATOR-62
>                 URL: https://issues.apache.org/jira/browse/CURATOR-62
>             Project: Apache Curator
>          Issue Type: Bug
>          Components: Recipes
>    Affects Versions: 2.2.0-incubating
>            Reporter: Doug Jones
>            Assignee: Jordan Zimmerman
>            Priority: Minor
>             Fix For: awaiting-response
>
>
> I've noticed that it is possible for a leader election to deadlock if a thread is interrupted
while it is trying to acquire the mutex for the election.
> I've created a forced example of this here: https://github.com/dfjones/curator/commit/544220b1e6b51c2718a7d3511a74962ff1c5ff48
> You can see deadlock by using my modified code and running the LeaderSelectorExample.
Some leaders may execute, but on my system I eventually see deadlock. Note that I only see
deadlock when running against a remote zk server rather than the embedded test server. I'm
using Zookeeper 3.4.5 on Mac OS X 10.8.4.
> From what I can tell by inspecting the ZK state/watching in the debugger, the thread
that is interrupted is able to successfully create the lock object in ZK. However, due to
the interrupt an exception is generated and LockInternals#internalLockLoop never runs. Later,
in LeaderSelector#doWork when mutex.release() is called this fails at the for lockData.
> Once this occurs, the lock object in ZK is the oldest and will cause deadlock.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message