hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Yang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy
Date Mon, 26 Mar 2018 16:27:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16414089#comment-16414089

Eric Yang commented on YARN-8044:

What if binary doesn't exist on one of the faulty node due to disk failure, and exit code
is -1.  We will want the retry to happen on some other nodes.  I am not sure that adding
logic to detect exit code is a good way to go about fixing retry policy.  There are too many
exit codes that have different meaning among applications. 

We might want to use the heuristic approach with failure validity intervals.  We might be
able to count number of failures within the time frame to decide if we should abort the containers.

> Determine the appropriate default ContainerRetryPolicy
> ------------------------------------------------------
>                 Key: YARN-8044
>                 URL: https://issues.apache.org/jira/browse/YARN-8044
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Shane Kumpf
>            Priority: Major
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which may be
too inclusive. Some error codes, such as -1, should likely result in a hard fail.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message