hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4685) AM blacklisting result in application to get hanged
Date Mon, 21 Mar 2016 17:29:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4685?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15204685#comment-15204685

Rohith Sharma K S commented on YARN-4685:

Initially thought to fix by calling another allocate call when ever there is node update event
{{RMApp->RMAppImpl}}. But there could be case where newly allocate call get the master
container before RMAppAttemptImpl gets container allocated event. In such case, RMAppAttemptImpl
should have handling mechanism. Like this many cases can occur. This option does not work.

Other approaches fixing this issue are recompute blacklist threshold EITHER for on node-added
&& node-remove event OR on every heartbeat for the *ALL* apps which are waiting for
AM container allocation and update appschedulinginfo for {{amBlacklist}} 

> AM blacklisting result in application to get hanged
> ---------------------------------------------------
>                 Key: YARN-4685
>                 URL: https://issues.apache.org/jira/browse/YARN-4685
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.8.0
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
> AM blacklist addition or removal is updated only when RMAppAttempt is scheduled i.e {{RMAppAttemptImpl#ScheduleTransition#transition}}.
But once attempt is scheduled if there is any removeNode/addNode in cluster then this is not
updated to {{BlackListManager#refreshNodeHostCount}}. This leads BlackListManager to operate
on stale NM's count. And application is in ACCEPTED state and wait forever even if we add
more nodes to cluster.
> Solution is update BlacklistManager for every {{RMAppAttemptImpl#AMContainerAllocatedTransition#transition}}
call. This ensures if there is any addition/removal in nodes, this will be updated to BlacklistManager

This message was sent by Atlassian JIRA

View raw message