hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Steve Loughran (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-614) Retry attempts automatically for hardware failures or YARN issues and set default app retries to 1
Date Wed, 25 Jun 2014 00:52:28 GMT

    [ https://issues.apache.org/jira/browse/YARN-614?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14042910#comment-14042910
] 

Steve Loughran commented on YARN-614:
-------------------------------------

I like this, but need to note one thing: our AM has a "suicide <delay>" IPC method which
we use for testing AM failure -we tell the AM to kill itself and then YARN brings it up somewhere
else.

It's essential that -somehow- I can replicate this behavior on live clusters. Is there a way
to do it here? Perhaps an exit code from the AM that says "please restart". That would also
allow live AMs to trigger a restart if they actually felt they were in a bad way


> Retry attempts automatically for hardware failures or YARN issues and set default app
retries to 1
> --------------------------------------------------------------------------------------------------
>
>                 Key: YARN-614
>                 URL: https://issues.apache.org/jira/browse/YARN-614
>             Project: Hadoop YARN
>          Issue Type: Improvement
>            Reporter: Bikas Saha
>            Assignee: Xuan Gong
>             Fix For: 2.5.0
>
>         Attachments: YARN-614-0.patch, YARN-614-1.patch, YARN-614-2.patch, YARN-614-3.patch,
YARN-614-4.patch, YARN-614-5.patch, YARN-614-6.patch, YARN-614.7.patch
>
>
> Attempts can fail due to a large number of user errors and they should not be retried
unnecessarily. The only reason YARN should retry an attempt is when the hardware fails or
YARN has an error. NM failing, lost NM and NM disk errors are the hardware errors that come
to mind.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

Mime
View raw message