hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Devaraj K (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2832) Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors
Date Mon, 16 Feb 2015 18:10:14 GMT

    [ https://issues.apache.org/jira/browse/YARN-2832?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14323092#comment-14323092

Devaraj K commented on YARN-2832:

Nice catch [~tianyin]. Thanks for your contribution.

The patch looks good to me except these comments.

- can you change the log level to INFO and log message similar to the one in NodeHealthScriptRunner.serviceStart().
LOG.info("Not starting node health monitor");

- and also can you remove the shouldRun() redundant check in NodeHealthScriptRunner.serviceStart().

> Wrong Check Logic of NodeHealthCheckerService Causes Latent Errors
> ------------------------------------------------------------------
>                 Key: YARN-2832
>                 URL: https://issues.apache.org/jira/browse/YARN-2832
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.4.1, 2.5.1
>         Environment: Any environment
>            Reporter: Tianyin Xu
>         Attachments: health.check.service.1.patch
> NodeManager allows users to specify the health checker script that will be invoked by
the health-checker service via the configuration parameter, "_yarn.nodemanager.health-checker.script.path_"

> During the _serviceInit()_ of the health-check service, NM checks whether the parameter
is set correctly using _shouldRun()_, as follows,
> {code:title=/* NodeHealthCheckerService.java */|borderStyle=solid}
>   protected void serviceInit(Configuration conf) throws Exception {
>     if (NodeHealthScriptRunner.shouldRun(conf)) {
>       nodeHealthScriptRunner = new NodeHealthScriptRunner();
>       addService(nodeHealthScriptRunner);
>     }
>     addService(dirsHandler);
>     super.serviceInit(conf);
>   }
> {code}
> The problem is that if the parameter is misconfigured (e.g., permission problem, wrong
path), NM does not have any log message to inform users which could cause latent errors or
mysterious problems (e.g., "why my scripts does not work?")
> I see the checking and printing logic is put in _serviceStart()_ function in _NodeHealthScriptRunner.java_
(see the following code snippets). However, the logic is very wrong. For an incorrect parameter
that does not pass the "shouldRun" check, _serviceStart()_ would never be called because the
_NodeHealthScriptRunner_ instance does not have the chance to be created (see the code snippets
> {code:title=/* NodeHealthScriptRunner.java */|borderStyle=solid}
>   protected void serviceStart() throws Exception {
>     // if health script path is not configured don't start the thread.
>     if (!shouldRun(conf)) {
>       LOG.info("Not starting node health monitor");
>       return;
>     }
>     ... 
>   }  
> {code}
> Basically, I think the checking and printing logic should be put in the serviceInit()
in NodeHealthCheckerService instead of serviceStart() in NodeHealthScriptRunner.
> See the attachment for the simple patch.

This message was sent by Atlassian JIRA

View raw message