hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
Date Fri, 24 Feb 2012 20:59:53 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215901#comment-13215901
] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

1) Add a ClusterManager to RM that currently only provides bad node information. All node
management will incrementally be moved to it. Eg. RmContext.RMnodes and RMContext.invalidNodes
could move into ClusterManager. This could be a new class or be an enhancement of NodesListManager.
2) Node is bad when it gets a heartbeat with health=unhealthy or when the livenessmonitor
reports lost node. In both cases a SchedulerNodeRemove event is issued. Similar for nodes
becoming healthy. Additionally, these will now emit events to update the ClusterManager with
corresponding events.
3) In AppMasterService, after calling Scheduler.allocate(), it will call ClusterManager.getUnusableNodes().
These will be passed to  RMAttempt object which will calculate the delta with previous known
bad machines. The delta will be sent with the allocate response and the current list will
be saved to calculate the next delta.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can
take some action like for e.g. re-executing map tasks whose intermediate output live on that
faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message