hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hadoop QA (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1996) Provide alternative policies for UNHEALTHY nodes.
Date Tue, 25 Nov 2014 01:44:13 GMT

    [ https://issues.apache.org/jira/browse/YARN-1996?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14223917#comment-14223917

Hadoop QA commented on YARN-1996:

{color:red}-1 overall{color}.  Here are the results of testing the latest attachment 
  against trunk revision 8caf537.

    {color:green}+1 @author{color}.  The patch does not contain any @author tags.

    {color:green}+1 tests included{color}.  The patch appears to include 6 new or modified
test files.

      {color:red}-1 javac{color}.  The applied patch generated 1223 javac compiler warnings
(more than the trunk's current 1219 warnings).

    {color:green}+1 javadoc{color}.  There were no new javadoc warning messages.

    {color:green}+1 eclipse:eclipse{color}.  The patch built with eclipse:eclipse.

    {color:green}+1 findbugs{color}.  The patch does not introduce any new Findbugs (version
2.0.3) warnings.

    {color:green}+1 release audit{color}.  The applied patch does not increase the total number
of release audit warnings.

    {color:red}-1 core tests{color}.  The patch failed these unit tests in hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app
hadoop-yarn-project/hadoop-yarn/hadoop-yarn-api hadoop-yarn-project/hadoop-yarn/hadoop-yarn-common


    {color:green}+1 contrib tests{color}.  The patch passed contrib unit tests.

Test results: https://builds.apache.org/job/PreCommit-YARN-Build/5927//testReport/
Javac warnings: https://builds.apache.org/job/PreCommit-YARN-Build/5927//artifact/patchprocess/diffJavacWarnings.txt
Console output: https://builds.apache.org/job/PreCommit-YARN-Build/5927//console

This message is automatically generated.

> Provide alternative policies for UNHEALTHY nodes.
> -------------------------------------------------
>                 Key: YARN-1996
>                 URL: https://issues.apache.org/jira/browse/YARN-1996
>             Project: Hadoop YARN
>          Issue Type: New Feature
>          Components: nodemanager, scheduler
>    Affects Versions: 2.4.0
>            Reporter: Gera Shegalov
>            Assignee: Gera Shegalov
>         Attachments: YARN-1996-2.patch, YARN-1996.v01.patch
> Currently, UNHEALTHY nodes can significantly prolong execution of large expensive jobs
as demonstrated by MAPREDUCE-5817, and downgrade the cluster health even further due to [positive
feedback|http://en.wikipedia.org/wiki/Positive_feedback]. A container set that might have
deemed the node unhealthy in the first place starts spreading across the cluster because the
current node is declared unusable and all its containers are killed and rescheduled on different
> To mitigate this, we experiment with a patch that allows containers already running on
a node turning UNHEALTHY to complete (drain) whereas no new container can be assigned to it
until it turns healthy again.
> This mechanism can also be used for graceful decommissioning of NM. To this end, we have
to write a health script  such that it can deterministically report UNHEALTHY. For example
> {code}
> if [ -e $1 ] ; then                                                                
>   echo ERROR Node decommmissioning via health script hack                          
> fi 
> {code}
> In the current version patch, the behavior is controlled by a boolean property {{yarn.nodemanager.unhealthy.drain.containers}}.
More versatile policies are possible in the future work. Currently, the health state of a
node is binary determined based on the disk checker and the health script ERROR outputs. However,
we can as well interpret health script output similar to java logging levels (one of which
is ERROR) such as WARN, FATAL. Each level can then be treated differently. E.g.,
> - FATAL:  unusable like today 
> - ERROR: drain
> - WARN: halve the node capacity.
> complimented with some equivalence rules such as 3 WARN messages == ERROR,  2*ERROR ==
FATAL, etc. 

This message was sent by Atlassian JIRA

View raw message