hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Junping Du (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4635) Add global blacklist tracking for AM container failure.
Date Wed, 03 Feb 2016 11:46:39 GMT

    [ https://issues.apache.org/jira/browse/YARN-4635?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15130272#comment-15130272
] 

Junping Du commented on YARN-4635:
----------------------------------

Thanks [~jianhe] for review and comments.
First, I would like to claim an assumption that the blacklist mechanism for AM launching is
not for tracking nodes that completely not work (unhealthy) but tracking nodes that has suspect
to fail the AM container due to previous failed experience. This is because we already have
unhealthy report mechanism to report serious issue for NM so here is another one which should
have a higher bar (as in some sense, AM container is more important than other container)
according to the history. 
My response will be based on above assumption.
bq. why should below container exit status back list the node ?
This container failure could due to resource congestion (like KILLED_EXCEEDED_PMEM) or unknown
reason (ABORTED, INVALID) that make this NM higher suspect than normal nodes.

bq. For DISKS_FAILED which is considered as global blacklist node in this jira, I think in
this case, the node will report as unhealthy and RM should remove the node already.
Some DISKS_FAILED could happens due to the failed container write disk to full. But it could
still have other directories available to use by node. It could still get launched with normal
containers but not suitable to risk AM container.

bq. AMBlackListingRequest contains a boolean flag and a threshold number. Do you think it’s
ok to just use the threshold number only ? 0 means disabled, and numbers larger than 0 means
enabled?
If so, it means the job submitter have to understand how many nodes the current cluster have
and the job parameter should be updated if it get submitted to different cluster (with different
nodes). IMO, That sounds more complexity to users.

> Add global blacklist tracking for AM container failure.
> -------------------------------------------------------
>
>                 Key: YARN-4635
>                 URL: https://issues.apache.org/jira/browse/YARN-4635
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: resourcemanager
>            Reporter: Junping Du
>            Assignee: Junping Du
>            Priority: Critical
>         Attachments: YARN-4635-v2.patch, YARN-4635.patch
>
>
> We need a global blacklist in addition to each app’s blacklist to track AM container
failures in global 
> affection. That means we need to differentiate the non­-succeed ContainerExitStatus
reasoning from 
> NM or more related to App. 
> For more details, please refer the document in YARN-4576.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message