hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shane Kumpf (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8044) Determine the appropriate default ContainerRetryPolicy
Date Tue, 27 Mar 2018 18:54:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8044?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16416093#comment-16416093

Shane Kumpf commented on YARN-8044:

{quote}What if binary doesn't exist on one of the faulty node due to disk failure, and exit
code is -1.  We will want the retry to happen on some other nodes.
I agree that we would want to retry in that case and can see the challenge with using exit
{quote}We might want to use the heuristic approach with failure validity intervals.  We
might be able to count number of failures within the time frame to decide if we should abort
the containers.
Make sense to me. It seems YARN-5015 / YARN-8032 addresses this approach.

Given the above, would it make more sense to re-purpose this issue to expose the retry policy
used by Native Services to the end user? We could use RETRY_ON_ALL_ERRORS as the default.

> Determine the appropriate default ContainerRetryPolicy
> ------------------------------------------------------
>                 Key: YARN-8044
>                 URL: https://issues.apache.org/jira/browse/YARN-8044
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>            Reporter: Shane Kumpf
>            Priority: Major
> {{AbstractLauncher}} sets the retry policy to {{RETRY_ON_ALL_ERRORS}}, which may be
too inclusive. Some error codes, such as -1, should likely result in a hard fail.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message