hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8245) Fix flakiness in TestZKFailoverController
Date Wed, 04 Apr 2012 03:19:19 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8245?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13245991#comment-13245991
] 

Todd Lipcon commented on HADOOP-8245:
-------------------------------------

For problem #1, the solution is the same as is already done in some other test cases. We just
need to add a workaround to clear the ZK MBeans before running the tearDown method. It's a
hack, but in the absense of a fix for ZOOKEEPER-1438, it's about all we can do.

I spent some time investigating problem #2. The bug is as follows:
- these test cases create a new ActiveStandbyElector, and call {{ActiveStandbyElector.ensureBaseNode()}}
on it before running the main body of the tests. Although they don't call {{joinElection()}},
the creation of the elector does create a {{zkClient}} object with an associated Watcher.
- in the {{testZookeeperFailure}} test case, we shut down and restart ZK. This causes the
above Watcher instance to fire its Disconnected and then Connected events. There was a bug
in the handling of the Connected event that would cause it to re-monitor the lock znode regardless
of whether it was previously in the election.
- So, when ZK comes back up, there was not two but *three* electors racing for the lock. However,
two of the electors actually corresponded to the same dummy service. In some cases this race
would be resolved in such a way that the test timed out.

I don't think this is a problem in practice, since the "formatZK" call runs in its own JVM
in the current code. However, it's worth fixing to get the tests to not be flaky, and to have
a more reasonable behavior. There are several fixes to be done:
- Add extra asserts for {{wantToBeInElection}} to catch cases where we might accidentally
re-join the election when we weren't supposed to be in it.
- Fix the handling of the "Connected" event to only re-join if the elector wants to be in
the election
- Cause exceptions thrown by watcher callbacks to be propagated back as fatal errors

Will post a patch momentarily.

                
> Fix flakiness in TestZKFailoverController
> -----------------------------------------
>
>                 Key: HADOOP-8245
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8245
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Minor
>
> When I loop TestZKFailoverController, I occasionally see two types of failures:
> 1) the ZK JMXEnv issue (ZOOKEEPER-1438)
> 2) TestZKFailoverController.testZooKeeperFailure fails with a timeout
> This JIRA is for fixes for these issues.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message