hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
Date Sat, 28 Mar 2015 07:59:52 GMT

    [ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385185#comment-14385185
] 

zhihai xu commented on YARN-3415:
---------------------------------

I can work on this issue, I read the problem at https://github.com/apache/spark/pull/5233#issuecomment-87160289.
It looks like the issue can be fixed by checking whether the AM is running before call addAMResourceUsage.
We should only addAMResourceUsage when AM is not running.
{code}
      if (!isAmRunning() && getLiveContainers().size() == 1 && !getUnmanagedAM())
{
        getQueue().addAMResourceUsage(container.getResource());
        setAmRunning(true);
      }
{code}


> Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
> ---------------------------------------------------------------------------------
>
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>
> We encountered this problem while running a spark cluster. The amResourceUsage for a
queue became artificially high and then the cluster got deadlocked because the maxAMShare
constrain kicked in and no new AM got admitted to the cluster.
> I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards amResourceUsage
is fragile. It depends on the number of live containers belonging to the app. We saw that
the spark AM went down without explicitly releasing its requested containers and then one
of those containers memory was counted towards amResource.
> cc - [~sandyr]



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message