hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4576) Enhancement for tracking Blacklist in AM Launching
Date Sat, 19 Mar 2016 08:50:33 GMT

    [ https://issues.apache.org/jira/browse/YARN-4576?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15202664#comment-15202664

Varun Vasudev commented on YARN-4576:

bq. No. DISKS_FAILED mark on bad disks are transient status for the node. Take an example,
if "yarn.nodemanager.disk-health-checker.max-disk-utilization-per-disk-percentage" is set
to 90% (by default), another job (no matter YARN or not) writing some files and deleting them
afterwards back-and-forth - if disks usage for the node is just happen to be around 90%, it
make NM's healthy status report to RM between healthy and unhealthy back and forth. Blacklist
of AM launching can evaluate history record to decide a better place to launch AM. The bar
for launching normal containers could be different or we could end up with so less choice.

Couple of points here -
# Specifically to the disks failed issue, there is now support for a watermark to avoid the
issue you described - YARN-3943.
# To the more general point of nodes switching back and forth from good to bad and back -
the better solution would be to have the RM detect bouncing nodes and then not to allocate
new containers to bouncing nodes until they stabilize.

> Enhancement for tracking Blacklist in AM Launching
> --------------------------------------------------
>                 Key: YARN-4576
>                 URL: https://issues.apache.org/jira/browse/YARN-4576
>             Project: Hadoop YARN
>          Issue Type: Improvement
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: EnhancementAMLaunchingBlacklist.pdf
> Before YARN-2005, YARN blacklist mechanism is to track the bad nodes by AM:  If AM tried
to launch containers on a specific node get failed for several times, AM will blacklist this
node in future resource asking. This mechanism works fine for normal containers. However,
from our observation on behaviors of several clusters: if this problematic node launch AM
failed, then RM could pickup this problematic node to launch next AM attempts again and again
that cause application failure in case other functional nodes are busy. In normal case, the
customized healthy checker script cannot be so sensitive to mark node as unhealthy when one
or two containers get launched failed. 
> After YARN-2005, we can have a BlacklistManager in each RMapp, so those nodes who launching
AM attempts failed for specific application before will get blacklisted. To get rid of potential
risks that all nodes being blacklisted by BlacklistManager, a disable-failure-threshold is
involved to stop adding more nodes into blacklist if hit certain ratio already. 
> There are already some enhancements for this AM blacklist mechanism: YARN-4284 is to
address the more wider case for AM container get launched failure and YARN-4389 tries to make
configuration settings available for change by App to meet app specific requirement. However,
there are still several gaps to address more scenarios:
> 1. We may need a global blacklist instead of each app maintain a separated one. The reason
is: AM could get more chance to fail if other AM get failed before. A quick example is: in
a busy cluster, all nodes are busy except two problematic nodes: node a and node b, app1 already
submit and get failed in two AM attempts on a and b. app2 and other apps should wait for other
busy nodes rather than waste attempts on these two problematic nodes.
> 2. If AM container failure is recognized as global event instead app own issue, we should
consider the blacklist is not a permanent thing but with a specific time window. 
> 3. We could have user defined black list polices to address more possible cases and scenarios,
so it reasonable to make blacklist policy pluggable.
> 4. For some test scenario, we could have whitelist mechanism for AM launching.
> 5. Some minor issues: it sounds like NM reconnect won't refresh blacklist so far.
> Will try to address all issues here.

This message was sent by Atlassian JIRA

View raw message