hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-4576) Extend blacklist mechanism to protect AM failed multiple times on failure nodes
Date Mon, 11 Jan 2016 16:11:39 GMT

     [ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Junping Du updated YARN-4576:
-----------------------------
    Description: Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried
to launch containers on a specific node get failed for several times, AM will blacklist this
node in future resource asking. This mechanism works fine for normal containers. However,
from our observation on behaviors of several clusters: if this problematic node launch AM
failed, then RM could pickup this problematic node to launch next AM attempts again and again
that cause application failure in case other functional nodes are busy. In normal case, the
customized healthy checker script cannot be so sensitive to mark node as unhealthy when one
or two containers get launched failed. However, in RM side, we can blacklist these nodes for
launching AM for a certain time if launching AMs get failed before.  (was: Current YARN blacklist
mechanism is to track the bad nodes by AM: If AM tried to launch containers on a specific
node get failed for several times, AM will blacklist this node in future resource asking.
This mechanism works fine for normal containers. However, from our observation on behaviors
of several clusters: if this problematic node launch AM failed, then RM could pickup this
problematic node to launch next AM attempts again and again that cause application failure
in case other functional nodes are busy. In normal case, the customized healthy checker script
cannot be so sensitive to mark node as unhealthy when one or two containers get launched failed.
However, in RM side, we can blacklist these nodes for launching AM for a certain time.)

> Extend blacklist mechanism to protect AM failed multiple times on failure nodes
> -------------------------------------------------------------------------------
>
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>
> Current YARN blacklist mechanism is to track the bad nodes by AM: If AM tried to launch
containers on a specific node get failed for several times, AM will blacklist this node in
future resource asking. This mechanism works fine for normal containers. However, from our
observation on behaviors of several clusters: if this problematic node launch AM failed, then
RM could pickup this problematic node to launch next AM attempts again and again that cause
application failure in case other functional nodes are busy. In normal case, the customized
healthy checker script cannot be so sensitive to mark node as unhealthy when one or two containers
get launched failed. However, in RM side, we can blacklist these nodes for launching AM for
a certain time if launching AMs get failed before.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message