hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Allen Wittenauer (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5567) Fix script exit code checking in NodeHealthScriptRunner#reportHealthStatus
Date Mon, 12 Sep 2016 18:06:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5567?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15484820#comment-15484820
] 

Allen Wittenauer commented on YARN-5567:
----------------------------------------

bq. would you prefer this be a config setting to choose the behavior?

The history of the health check script is interesting, but long.  But not trusting the exit
code was one of the key learnings by the ops team from the HOD experience. It fails a lot
more often than people realize, mainly due to users doing crazy things, especially on insecure
systems.

This is one of those times where it's going to be extremely difficult to convince me otherwise.
 I can't think of a reason to ever trust the exit code enough to bring down the NodeManager.
  In this particular environment, the number of conditions that the script can fail for reasons
which may be temporary/pointless are many.  

Now it could be argued that those temporary failures should cause the NM to come down, but
then you get into a race condition between heartbeats and actual issues.  HDFS worked around
it by basically saying "it has to fail for X long". Ignoring the exit code avoids that problem
because one can be sure that "ERROR -" really did come from the script.

bq. Alternatively, would you be okay with standardizing on a specific error code for "detected
bad Node" vs "bad script"?

If by error code you specifically mean the value the NM reports back to the RM, yes that makes
sense.  It just can't fail the node.  

> Fix script exit code checking in NodeHealthScriptRunner#reportHealthStatus
> --------------------------------------------------------------------------
>
>                 Key: YARN-5567
>                 URL: https://issues.apache.org/jira/browse/YARN-5567
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.8.0, 3.0.0-alpha1
>            Reporter: Yufei Gu
>            Assignee: Yufei Gu
>             Fix For: 3.0.0-alpha1
>
>         Attachments: YARN-5567.001.patch
>
>
> In case of FAILED_WITH_EXIT_CODE, health status should be false.
> {code}
>       case FAILED_WITH_EXIT_CODE:
>         setHealthStatus(true, "", now);
>         break;
> {code}
> should be 
> {code}
>       case FAILED_WITH_EXIT_CODE:
>         setHealthStatus(false, "", now);
>         break;
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message