hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4148) When killing app, RM releases app's resource before they are released by NM
Date Tue, 28 Jun 2016 10:26:57 GMT

    [ https://issues.apache.org/jira/browse/YARN-4148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15352762#comment-15352762

Jun Gong commented on YARN-4148:

Sorry for late. Thanks [~jlowe] for your patch, the patch is more reasonable than mine. Assign
it to you now.

We could have the RM wait until it receives hard confirmation from the NM before it releases
the resources associated with a container, but that would needlessly slow down scheduling
in some cases.
The propose is very reasonable. Not sure whether it works well in your cluster? Killing app
is not the only case that leads to mismatch of state between NMs and the RM. When app completed
and did not clean some containers, RM needs wait those unfinished containers to finish too.
These cases might make RM's available resource look like less than before because RM does
not release them actually, will it affect the scheduling speed for a busy cluster?

One way to solve it is to assume the container resources could still be "used" until it has
had a chance to tell the NM that the container token for that container is no longer valid
and confirmed in a subsequent NM heartbeat that the container has not appeared since.
How about this idea: RM consider it used until the container becomes RUNNING(then RM kills
it) or becomes invalid? However it will makes the resource unavailable even it has been freed.

> When killing app, RM releases app's resource before they are released by NM
> ---------------------------------------------------------------------------
>                 Key: YARN-4148
>                 URL: https://issues.apache.org/jira/browse/YARN-4148
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-4148.001.patch, YARN-4148.wip.patch, free_in_scheduler_but_not_node_prototype-branch-2.7.patch
> When killing a app, RM scheduler releases app's resource as soon as possible, then it
might allocate these resource for new requests. But NM have not released them at that time.
> The problem was found when we supported GPU as a resource(YARN-4122).  Test environment:
a NM had 6 GPUs, app A used all 6 GPUs, app B was requesting 3 GPUs. Killed app A, then RM
released A's 6 GPUs, and allocated 3 GPUs to B. But when B tried to start container on NM,
NM found it didn't have 3 GPUs to allocate because it had not released A's GPUs.
> I think the problem also exists for CPU/Memory. It might cause OOM when memory is overused.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message