hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Naganarasimha G R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5635) Better handling when bad script is configured as Node's HealthScript
Date Tue, 13 Sep 2016 17:45:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15487878#comment-15487878

Naganarasimha G R commented on YARN-5635:

[~rchiang], Unfortunately i reopened this jira and reworded almost about the same time. Sorry
was not aware new jira got raised little earlier than this and thanks for closing it. you
can go ahead and make it subtask of YARN-5078.

Well for your new approach it almost sounds like a incompatible change for existing node health
scripts to define a new exit code. But is it required ? Existing code treats any exit code
other zero as unsuccessful and reports it as {{HealthCheckerExitStatus.FAILED_WITH_EXIT_CODE}}.
But {{HealthCheckerExitStatus.FAILED}} is thrown when the output of script as {{"ERROR"}}
string in it.

So what we would want to address here would be, if the script output has errors or script
gets timed out then how to handle better. In this case it would *not* be good to gracefully
drain the NM directly, but to report that status could not be got from the NM properly through
script. Any thoughts on my earlier comment 
NM can inform Healthy/UnHealthy/HealthValidationError, And this can be sent across Heartbeat
to RM and RM can capture the state of this NM to be other than Running and UnHealthy (a New
state). This can be displayed in the WebUI and also in the can be queried using ./yarn node
-list -state

> Better handling when bad script is configured as Node's HealthScript
> --------------------------------------------------------------------
>                 Key: YARN-5635
>                 URL: https://issues.apache.org/jira/browse/YARN-5635
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Allen Wittenauer
>            Assignee: Yufei Gu
> Earlier fix to YARN-5567 is reverted because its not ideal to get the whole cluster down
because of a bad script. At the same time its important to report that script is erroneous
which is configured as node health script as it might miss to detect bad health of a node.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message