hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Fri, 30 Mar 2012 19:28:28 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13242664#comment-13242664

Todd Lipcon commented on HADOOP-8220:

bq. Any reason we shouldn't make SLEEP_AFTER_FAILURE_TO_BECOME_ACTIVE configurable?

Currently, ActiveStandbyElector doesn't take a Configuration object. I think many of the parameters
should be changed to be configured via Configuration, but I didn't want to make this into
a bigger scoped change.

bq. There's some inconsistency in capitalization between "reJoinElection" and "rejoinElectionAfterFailureToBecomeActive"

Changed to consistently use "reJoin" to match the previously existing code.

bq. Might want to do a s/System.currentTimeMillis/Util.now/g

The {{Util}} class is in HDFS, but this code is in common. We don't seem to have an equivalent
in common.

bq. Any reason we shouldn't make LOG_INTERVAL_MS configurable?
It's just test code, so seemed unnecessary.

bq. Add @VisibleForTesting to sleepFor, since it would be private (and probably static) otherwise.
Maybe even add a comment saying why it's not static.
bq. Considering the comment says "after sleeping for a short period" in TestActiveStandbyElector#testFailToBecomeActive,
maybe also verify that sleepFor was called? Likewise in testFailToBecomeActiveAfterZKDisconnect.

Done. I made the overridden method keep a tally of number of slept millis, and added asserts
to the tests to make sure it slept for some time when rejoining.
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: auto-failover, ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt, hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message