hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4079) Retrospect on the decision of making yarn.dispatcher.exit-on-error as true explicitly in daemons
Date Tue, 25 Aug 2015 15:23:45 GMT

    [ https://issues.apache.org/jira/browse/YARN-4079?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14711431#comment-14711431
] 

Junping Du commented on YARN-4079:
----------------------------------

Thanks for filing this JIRA, [~varun_saxena].
bq. Probably we can read this value from configuration and set it to true in daemons if not
configured. This way in production clusters if there is an exception which is leading to the
daemon crashing frequently and we find that its unavoidable but not a very big issue(i.e daemon
can still work normally for most part), we can atleast set the configuration to false in config
file.
I don't mean to simply make this configuration public and allow user to specify false to disable
exit-on-failure when exception happen. This could make things worse if critical exceptions
happen but NMs/RM are still running as normal. We should think more on this.

> Retrospect on the decision of making yarn.dispatcher.exit-on-error as true explicitly
in daemons
> ------------------------------------------------------------------------------------------------
>
>                 Key: YARN-4079
>                 URL: https://issues.apache.org/jira/browse/YARN-4079
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: yarn
>    Affects Versions: 2.7.1
>            Reporter: Varun Saxena
>            Assignee: Varun Saxena
>
> Currently in all daemons this config is explicitly set to true so that daemons can crash
instead of hanging around. While this seems to be correct, as a  recoverable exception should
be caught and handled and NOT leaked through to AsyncDispatcher. And a non recoverable one
should lead to a crash anyways.
> But this can make system more fragile in case we miss to catch all recoverable exceptions.
> Currently we do not even have an option of setting it to false in configuration, even
if we would want. 
> Probably we can read this value from configuration and set it to true in daemons if not
configured.
> This way in production clusters if there is an exception which is leading to the daemon
crashing frequently and we find that its unavoidable but not a very big issue(i.e daemon can
still work normally for most part), we can atleast set the configuration to false in config
file.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message