hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Bikas Saha (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
Date Wed, 01 May 2013 00:10:16 GMT

    [ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13646188#comment-13646188

Bikas Saha commented on YARN-614:

Agree about a method that encapsulates if an rmappattempt failed with an error we want to
ignore. Perhaps rename it to countFailureToAttemptLimit() or something like that. We can then
add hysteresis logic later on for perpetual apps for which we want to count failures only
in the last hour say.
I am afraid allowing appattempt.size to exceed maxAttempts might break code somewhere else
that did not expect this to happen. Need to check thoroughly for this.
The recovery code wont work since right now, the RM does not recover attempts where appattempts.size()
> maxattempts. eg. of above case. Look at RMAppManager.recover(). One solution could be
to move the check from finishAttempt() to createAttempt(). finishAttempt() always enqueues
a new attempt. the new attempt creation checks if one can still be created based on failed
count etc. Another solution could be to make the RMApp go from NEW to FAILED in the recover
transition based on failed counts etc.

Having said that, recovery wont work because the mastercontainer is saved before launching
the attempt and as such does not have the exit status populated in it. We could leave recovery
for a different jira and focus on the regular code path in this one perhaps.
> Retry attempts automatically for hardware failures or YARN issues and set default app
retries to 1
> --------------------------------------------------------------------------------------------------
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>         Attachments: YARN-614-0.patch
> Attempts can fail due to a large number of user errors and they should not be retried
unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or
YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come
to mind.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators
For more information on JIRA, see: http://www.atlassian.com/software/jira

View raw message