hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HADOOP-8220) ZKFailoverController doesn't handle failure to become active correctly
Date Tue, 27 Mar 2012 17:04:27 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-8220?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13239641#comment-13239641

Todd Lipcon commented on HADOOP-8220:

I'll add a new test to the ActiveStandbyElector-specific code for this. I was testing it via
the "integration test", but you're right that adding to the unit tests makes sense too.

bq. How does NPE occur when the elector makes sure the client is recreated upon rejoining
the election? Which zkClient are you talking about?

The NPE occurred in the previous code because we had the following sequence:
- createNode succeeded
- called ZKFC becomeActive() callback
-- becomeActive() throws exception
-- ZKFC had a catch() clause which called quitElection () (it turned out this wasn't the right
--- quitElection() nulled out zkClient
- ActiveStandbyElector called monitorNode(), which tried to use zkClient, which had just been
nulled out.

The new behavior avoids this, since the error handling patch is in ActiveStandbyElector itself.
This makes it easier to get the right semantics.

bq. What is the purpose of adding the sleep? Could you please elaborate?

Without the sleep, it will do a tight loop retrying to become active. This generates a lot
of log spew and has little actual benefit. If instead we retry only once a second, then (a)
the  logs are more readable, and (b) if there is another StandbyNode in the cluster, it will
get a chance to try to become active.

I will add a comment to this effect in the code.
> ZKFailoverController doesn't handle failure to become active correctly
> ----------------------------------------------------------------------
>                 Key: HADOOP-8220
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8220
>             Project: Hadoop Common
>          Issue Type: Bug
>          Components: ha
>    Affects Versions: 0.23.3, 0.24.0
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>            Priority: Critical
>         Attachments: hadoop-8220.txt
> The ZKFC doesn't properly handle the case where the monitored service fails to become
active. Currently, it catches the exception and logs a warning, but then continues on, after
calling quitElection(). This causes a NPE when it later tries to use the same zkClient instance
while handling that same request. There is a test case, but the test case doesn't ensure that
the node that had the failure is later able to recover properly.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message