hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
Date Fri, 24 Feb 2012 01:57:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215306#comment-13215306

Bikas Saha commented on MAPREDUCE-3353:

A potential solution would be the following
1) have the scheduler interface return the set of bad nodes on which it has stopped scheduling.
This keeps the decision of which node is bad in the scheduler. The scheduler is the ultimate
authority on what runs on a node and should tell its clients whether about the nodes that
it is not considering for scheduling.
2) 1) above could be done as another interface API or piggybacked on the scheduler.allocate()
3) The response could contain all the known bad nodes or deltas to the previous response.
Deltas are cheaper to send but are susceptible to message loss and retransmission. Also, deltas
would have to be divided into new bad nodes and new good nodes.
4) The AM might want to know the type of bad node. Say lost or unhealthy etc. The bad nodes
information could be enhanced via querying the RMNode object for the actual reason/health.

As an enhancement, we could add a new RMNodeMananger entity that manages all the RMNodes.
The above functionality could move from the scheduler into RMNodeManager (though it would
need to be in sync with the scheduler). After that, getting detailed information may not need
direct access to RMNode object. Potentially, other interactions with RMNode could be forwarded
through the RMNodeManager. But this would be a fairly significant refactoring thats best left
to a separate future work item.
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
> When a node gets lost or turns faulty, AM needs to know about that event so that it can
take some action like for e.g. re-executing map tasks whose intermediate output live on that
faulty node.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message