hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hong Zhiguo (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-4181) node blacklist for AM launching
Date Fri, 18 Sep 2015 07:58:04 GMT
Hong Zhiguo created YARN-4181:

             Summary: node blacklist for AM launching
                 Key: YARN-4181
                 URL: https://issues.apache.org/jira/browse/YARN-4181
             Project: Hadoop YARN
          Issue Type: Bug
          Components: resourcemanager
            Reporter: Hong Zhiguo
            Assignee: Hong Zhiguo
            Priority: Minor

In some cases, a node goes problematic and most launching containers fail on this node, as
well as the launching AM containers.
Then this node has more available resource than other nodes in the cluster. The Application
whose AM is failing has zero minShareRatio. With fair scheduler, this node is always rated
first, and the misfortune Application is also likely rated first. The result is:  attempts
of the this application are failing again and again on the same node.

Solution 1: NM could detect the failure rate of containers. If the rate is high, the NM marks
itself to unhealthy for a period. But we should be careful not to turn all nodes into unhealthy
by a buggy Application. Maybe use failure rate of containers for different Applications.

Solution 2: To have Application level blacklist by AMLauncher, in addition to existing blacklist
by AM.

This message was sent by Atlassian JIRA

View raw message