hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "zhihai xu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3415) Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
Date Sun, 29 Mar 2015 08:20:52 GMT

    [ https://issues.apache.org/jira/browse/YARN-3415?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14385672#comment-14385672

zhihai xu commented on YARN-3415:

[~ragarwal], thanks for the comment.
bq. 1. If the above approach is valid - why do we need the getLiveContainers() check at all?
totally agree, If we check !isAmRunning(), getLiveContainers() check is redundant.

bq. 2. I don't see any place where we are setting amRunning to false once it is set to true.
Should we do that for completeness?
We don't need to set it false. because each FSAppAttempt has only one AM, once FSAppAttempt
is removed, it will be garbage collected.

bq. 3. Why is there no getUnmanagedAM() check in removeApp where we are subtracting from amResourceUsage.
I think the conditions for adding and subtracting amResourceUsage should be similar as much
as possible.
totally agree, it will be better to check getUnmanagedAM() for readability.
Currently it works, because we check getUnmanagedAM() when we setAMResource in FairScheduler#allocate.
So if getUnmanagedAM() is true, app.getAMResource() will return Resources.none().
And also we can remove the check app.getAMResource() != null because the following code will
guarantee it will not return null.
  private Resource _get(String label, ResourceType type) {
    try {
      UsageByLabel usage = usages.get(label);
      if (null == usage) {
        return Resources.none();
      return normalize(usage.resArr[type.idx]);
    } finally {

About my previous comments
bq. It looks like we should also check isAmRunning at FairScheduler#allocate
Checking isAmRunning at FairScheduler#allocate is not necessary. Because except AM container,
all other containers for FSAppAttempt will be allocated by AM. once AM container is finished,
no more FairScheduler#allocate will be called.

I will upload a patch with a test case for this issue.

> Non-AM containers can be counted towards amResourceUsage of a fairscheduler queue
> ---------------------------------------------------------------------------------
>                 Key: YARN-3415
>                 URL: https://issues.apache.org/jira/browse/YARN-3415
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: Rohit Agarwal
>            Assignee: zhihai xu
>            Priority: Critical
> We encountered this problem while running a spark cluster. The amResourceUsage for a
queue became artificially high and then the cluster got deadlocked because the maxAMShare
constrain kicked in and no new AM got admitted to the cluster.
> I have described the problem in detail here: https://github.com/apache/spark/pull/5233#issuecomment-87160289
> In summary - the condition for adding the container's memory towards amResourceUsage
is fragile. It depends on the number of live containers belonging to the app. We saw that
the spark AM went down without explicitly releasing its requested containers and then one
of those containers memory was counted towards amResource.
> cc - [~sandyr]

This message was sent by Atlassian JIRA

View raw message