hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
Date Tue, 02 Feb 2016 16:49:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15128533#comment-15128533

Junping Du commented on YARN-4635:

bq. If we can launch an AM container at some later point of time after the first failure,
we can remove that node immediately from global blacklisting.
In most case, AM container won't get chance to launch again on this node because blacklist
mechanism already blacklist it get allocated. However, the corner case is: two AM containers
get launched at the same time, one failure but the other one successful. IMO, the successfully
completed one shouldn't purge node from blacklist as normal node as the failure marked as
global affected like DISK_FAILURE could still happen on coming am containers. In another words,
it still get more risky for AM launched on this node which is not changed by another AM container
finished. We can discuss more about purge node from global list, like: time based, event (NM
reconnect) based, etc. in a dedicated JIRA YARN-4637 that I filed before.

bq. I think SimpleBlacklistManager#refreshNodeHostCount can pre-compute failure threshold
also along with updating numberOfNodeManagerHosts. So whoever is invoking getBlacklistUpdates
need not have to compute always. This is minor suggestion in existing code.
Sounds good. Updated in v2 patch.

bq. There are chances of duplicates from global and per-app level blacklists, correct?. So
could we use a Set here. One possibility, one AM container failed due to ABORTED and added
to per-app level blacklist, second attempt failed to due to DISK_FAILED and added to global.
Now this will be a duplicate scenario. Thoughts?
Nice catch! The same app with different attempts won't cause this duplicated issue. The possible
duplicated scenario is: an app AM failed on this node for reason like ABORTED, but at the
mean time, the other app's AM failed on this node for DISK_FAILURE, then the same node could
be duplicated on two list. Fix this issue in v2 patch.

There is another issue that the threshold control on BlacklistManager is applied on two list
(global and per app) separately, so it is possible that two lists together could unexpectedly
blacklist all nodes. We need a thread-safe merge operation for two BlacklistManagers to address
this problem. Mark a TODO item in the patch. Will file a separated JIRA to fix this.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635.patch
> We need a global blacklist in addition to each app’s blacklist to track AM container
failures in global 
> affection. That means we need to differentiate the non­-succeed ContainerExitStatus
reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.

This message was sent by Atlassian JIRA

View raw message