hadoop-hdfs-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Mingjie Lai (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HDFS-2185) HA: HDFS portion of ZK-based FailoverController
Date Wed, 04 Apr 2012 20:07:23 GMT

    [ https://issues.apache.org/jira/browse/HDFS-2185?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13246676#comment-13246676
] 

Mingjie Lai commented on HDFS-2185:
-----------------------------------

(still posting comments here since the design doc is attached here)

@todd

Thanks for adding the manual failover section 2.7 in the design doc. 

However I have some questions for what you described in 2.7.2:
- HAAdmin makes an RPC failoverToYou() to the target ZKFC
- target ZKFC sends an RPC concedeLock() to the currently active ZKFC.
- the active sends a transitionToStandby() RPC to its local node

IMO the chain of RPCs is quite complicated, not easy to debug and troubleshoot in operation.
Because you're trying to resolve the 2 problems, auto and manual failover, at one place --
ZKFC. 

How about seperate the 2 cases:
- add commands at haadmin to start/stop autofailover
- stop-autofailover requests all ZKFC to exitElection
- start-autofailover requests all ZKFC to enterElction
- haadmin is responsible for handle manual failover (as current implementation)
- admins can only perform manual failover when autofailover is stopped
- can be used to specify one particular active NN

Pros:
- existing manual fo code can be kept mostly
- although new RPC is added to ZKFC but we don't need them to talk to each other. the manual
failover logic is all handled at client -- haadmin. 
- easier to extend to the case of multiple standby NNs

cons:
- administrator needs to explicitly start/stop autofailover, in addition to ZKFC process.


                
> HA: HDFS portion of ZK-based FailoverController
> -----------------------------------------------
>
>                 Key: HDFS-2185
>                 URL: https://issues.apache.org/jira/browse/HDFS-2185
>             Project: Hadoop HDFS
>          Issue Type: Sub-task
>          Components: auto-failover, ha
>    Affects Versions: 0.24.0, 0.23.3
>            Reporter: Eli Collins
>            Assignee: Todd Lipcon
>             Fix For: Auto failover (HDFS-3042)
>
>         Attachments: Failover_Controller.jpg, hdfs-2185.txt, hdfs-2185.txt, hdfs-2185.txt,
hdfs-2185.txt, hdfs-2185.txt, zkfc-design.pdf, zkfc-design.pdf, zkfc-design.pdf, zkfc-design.pdf,
zkfc-design.tex
>
>
> This jira is for a ZK-based FailoverController daemon. The FailoverController is a separate
daemon from the NN that does the following:
> * Initiates leader election (via ZK) when necessary
> * Performs health monitoring (aka failure detection)
> * Performs fail-over (standby to active and active to standby transitions)
> * Heartbeats to ensure the liveness
> It should have the same/similar interface as the Linux HA RM to aid pluggability.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message