hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
Date Tue, 02 Feb 2016 15:17:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128389#comment-15128389

Sunil G commented on YARN-4635:

Hi [~djp]
Thanks for sharing the patch fast. Overall looks fine for me.

Few points:
1. Per app blacklist manager need not have to consider the case to remove a node from this
blacklist. But for global blacklist manager, i think we need a {{removeNode}} interface in
{{BlacklistManager}}. If we can launch an AM container at some later point of time after the
first failure, we can remove that node immediately from global blacklisting. May be {{RMAppAttemptImpl#checkStatusForNodeBlacklisting}}
can check for success too (Or are we planning to handle in the ticket where we try to come
with time based clear mechanism). Thoughts?

2. I think {{SimpleBlacklistManager#refreshNodeHostCount}} can pre-compute failure threshold
also along with updating {{numberOfNodeManagerHosts}}. So whoever is invoking {{getBlacklistUpdates}}
need not have to compute always. This is  minor suggestion in existing code.

+        // No thread safe problem as getBlacklistUpdates() in
+        // SimpleBlacklistManager do clone operation to blacklistNodes
+        List<String> amBlacklistAdditions = new ArrayList<String>();

There are chances of duplicates from global and per-app level blacklists, correct?. So could
we use a Set here. One possibility, one AM container failed due to ABORTED and added to per-app
level blacklist, second attempt failed to due to DISK_FAILED and added to global. Now this
will be a duplicate scenario. Thoughts?


> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635.patch
> We need a global blacklist in addition to each app’s blacklist to track AM container
failures in global 
> affection. That means we need to differentiate the non­-succeed ContainerExitStatus
reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.

This message was sent by Atlassian JIRA

View raw message