hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-7790) Improve Capacity Scheduler Async Scheduling to better handle node failures
Date Tue, 23 Jan 2018 12:20:00 GMT

     [ https://issues.apache.org/jira/browse/YARN-7790?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wangda Tan updated YARN-7790:
-----------------------------
    Attachment: YARN-7790.001.patch

> Improve Capacity Scheduler Async Scheduling to better handle node failures
> --------------------------------------------------------------------------
>
>                 Key: YARN-7790
>                 URL: https://issues.apache.org/jira/browse/YARN-7790
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Critical
>         Attachments: YARN-7790.001.patch
>
>
> This is not a new issue but async scheduling makes it worse:
> In sync scheduling, if an AM container allocated to a node, it assumes node just heartbeat
to RM, and in the same response, it will be sent back to NM. Even though it is possible that
NM crashes after the heartbeat, which causes AM hangs for 10 mins. But it is related rare.
> In async scheduling world, multiple AM containers can be placed on a problematic NM,
which could cause application hangs for long time. Discussed with [~sunilg] , we need at least
two fixes:
> When async scheduling enabled:
> 1) Skip node which missed X node heartbeat.
> 2) Kill AM container in ALLOCATED state on a node which missed Y node heartbeat.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message