hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jason Lowe (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-4951) Container preemption interpreted as task failure
Date Tue, 22 Jan 2013 20:28:13 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-4951?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13559978#comment-13559978
] 

Jason Lowe commented on MAPREDUCE-4951:
---------------------------------------

Like the comment states in FairScheduler.preemptResources, I too am unsure if the preemption
is translated into a kill command to the NM by the RM directly or if the scheduler is relying
on the AM to see the finished container status from the RM and issue the kill to the AM.

If it's the latter, then the container will be killed after the AM has already determined
the container status correctly.  If the RM really is cleaning up the container and turning
that into a kill command for the NM, then we've got problems.  The task itself could fail
as the JVM tears down from a kill command and report that failure to the AM via the task umbilical
*before* the AM discovers via the heartbeat to the RM that the container was preempted.  A
similar race occurs now when an NM kills a container for being over limits, see MAPREDUCE-4955.
                
> Container preemption interpreted as task failure
> ------------------------------------------------
>
>                 Key: MAPREDUCE-4951
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-4951
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: applicationmaster, mr-am, mrv2
>    Affects Versions: 2.0.2-alpha
>            Reporter: Sandy Ryza
>            Assignee: Sandy Ryza
>         Attachments: MAPREDUCE-4951-1.patch, MAPREDUCE-4951.patch
>
>
> When YARN reports a completed container to the MR AM, it always interprets it as a failure.
 This can lead to a job failing because too many of its tasks failed, when in fact they only
failed because the scheduler preempted them.
> MR needs to recognize the special exit code value of -100 and interpret it as a container
being killed instead of a container failure.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

Mime
View raw message