hadoop-common-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Todd Lipcon (Updated) (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HADOOP-8247) Auto-HA: add a config to enable auto-HA, which disables manual FC
Date Fri, 06 Apr 2012 04:07:19 GMT

     [ https://issues.apache.org/jira/browse/HADOOP-8247?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Todd Lipcon updated HADOOP-8247:
--------------------------------

    Attachment: hadoop-8247.txt

Here's a preliminary patch for this issue. It still needs a little cleanup/javadoc/etc, but
wanted to make sure people agree this is the right direction before I finish it up.

Here's a summary of the change:

- Add a new flag dfs.ha.automatic-failover.enabled, which is set per-nameservice or globally
- Add a new RequestInfo structure as a parameter to all the HAServiceProtocol methods. This
currently just has one field, which indicates what type of client the request is on behalf
of. It can either be a user (manual CLI failover), ZKFC (auto failover), or USER_FORCE --
indicating that it's a user who wants to avoid this safety check.
- In the NN, if auto-failover is enabled, disallow HA requests from users. If it's not enabled,
disallow HA requests from ZKFCs.
- In the ZKFC, disallow startup if auto-failover is disabled


In addition to the unit tests, I ran the following manual tests, on a secure cluster.

1) did not enable auto failover config
2) ran failovers using haadmin command, succesfully
3) Tried to run bin/hdfs zkfc, got expected error:
{code}
12/04/05 20:53:38 INFO tools.DFSZKFailoverController: Failover controller configured for NameNode
nameserviceId1.nn1
12/04/05 20:53:38 FATAL ha.ZKFailoverController: Automatic failover is not enabled for NameNode
at todd-w510/127.0.0.1:8021. Please ensure that automatic failover is enabled in the configuration
before running the ZK failover controller.
{code}
4) Enabled auto-failover in my config, but left NNs running. Got error when the ZKFC tried
to make the local node active. TODO in future JIRA: it could abort at this point, when it
sees an AccessControlException, since it's indicative of misconfiguration.

5) Restarted NNs, so they picked up the new config.
6) Ran ZKFC, it successfully made one of the NNs active. Verified automatic failover behavior
by killing one of the NNs.
7) Ran manual failover command, got expected error:
{code}
12/04/05 20:58:31 ERROR ha.FailoverController: Unable to get service state for NameNode at
todd-w510/127.0.0.1:8022: Manual HA control for this NameNode is disallowed, because automatic
HA is enabled.
{code}

----

Open questions: should we allow the non-mutative commands like {{monitorHealth}} and {{getServiceState}}
to run when auto-failover is configured? My thinking is probably. If so, should we keep around
the RequestInfo parameter on those calls? Or only include RequestInfo for the calls that trigger
transitions?

                
> Auto-HA: add a config to enable auto-HA, which disables manual FC
> -----------------------------------------------------------------
>
>                 Key: HADOOP-8247
>                 URL: https://issues.apache.org/jira/browse/HADOOP-8247
>             Project: Hadoop Common
>          Issue Type: Improvement
>          Components: auto-failover, ha
>    Affects Versions: Auto Failover (HDFS-3042)
>            Reporter: Todd Lipcon
>            Assignee: Todd Lipcon
>         Attachments: hadoop-8247.txt
>
>
> Currently, if automatic failover is set up and running, and the user uses the "haadmin
-failover" command, he or she can end up putting the system in an inconsistent state, where
the state in ZK disagrees with the actual state of the world. To fix this, we should add a
config flag which is used to enable auto-HA. When this flag is set, we should disallow use
of the haadmin command to initiate failovers. We should refuse to run ZKFCs when the flag
is not set. Of course, this flag should be scoped by nameservice.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message