hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Vasudev (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
Date Thu, 04 Feb 2016 08:27:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15131938#comment-15131938
] 

Varun Vasudev commented on YARN-4635:
-------------------------------------

I agree with [~jianhe] - 

1) KILLED_EXCEEDED_PMEM and KILLED_EXCEEDED_VMEM are container specific and no reason to blacklist
the node. The AM will be killed for exceeding it's pmem and vmem irrespective of other containers
on the node.

2) For DISKS_FAILED the RM should mark the disk as bad and the node should be skipped.

In addition, I don't see any mechanism for purging the blacklist in this patch - if that's
the case we should work on this in a feature branch and not commit to trunk/branch-2 until
we have the purge mechanism sorted out.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM container
failures in global 
> affection. That means we need to differentiate the non­-succeed ContainerExitStatus
reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message