hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-3353) Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
Date Fri, 24 Feb 2012 02:51:49 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-3353?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13215328#comment-13215328
] 

Bikas Saha commented on MAPREDUCE-3353:
---------------------------------------

Not doing deltas on the RM-AM channel does not seem viable because of high frequency message
traffic. Sending information about 100 bad nodes at 100 bytes per node for 1000AM's every
second is about 10MB/s of traffic.
Sending deltas means tracking last and current states on the RM on a per AM attempt basis.
That would not be good to do in the scheduler because its not the responsibility of the scheduler.
So this needs to be done on each RMAttempt object. The RMAttempt object gets the current list
of bad nodes and compares it with its last known list of bad nodes. Additions and deletions
are sent to the AM as new bad and good nodes.
Alternatively, each RMNode could send an event to each RMAppAttempt for healthy->unhealthy
and vice versa transitions. These events could be accumulated and copied to the AM via the
allocate response.
                
> Need a RM->AM channel to inform AMs about faulty/unhealthy/lost nodes
> ---------------------------------------------------------------------
>
>                 Key: MAPREDUCE-3353
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-3353
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mrv2, resourcemanager
>    Affects Versions: 0.23.0
>            Reporter: Vinod Kumar Vavilapalli
>            Assignee: Bikas Saha
>            Priority: Critical
>             Fix For: 0.23.2
>
>
> When a node gets lost or turns faulty, AM needs to know about that event so that it can
take some action like for e.g. re-executing map tasks whose intermediate output live on that
faulty node.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message