hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chengbing Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
Date Sun, 04 Jan 2015 03:55:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14263747#comment-14263747

Chengbing Liu commented on YARN-2997:

I think we can simplify the logic in getContainerStatuses as such:
It seems that if we do not remove the containers whose app is already stopped, we will rely
on the heartbeat response from RM to remove containers acked by AM. If something goes wrong
on the AM or RM side, the NM will never remove these containers from context. So in my opinion,
that could be a potential leak.

the sub class has the equal method.
Yes, you are right. However, I'm still not sure if it is a good idea to use {{Set<ContainerStatus>}}
instead of {{Map<ContainerId, ContainerStatus>}} for the following reasons:
* {{ContainerId}} is a unique identifier for a container, while {{ContainerStatus}} can be
changed over time, even for the same container.
* We want to ensure no duplicate container status reported to RM. {{ContainerStatus}} has
not only containerId, but also container state, exit status and diagnostic message, we may
run into a situation where we report two different {{ContainerStatus}} with same ID and different
states or other stuffs.
* {{ContainerId}} has {{equals}} method and annotated as public and stable, while {{ContainerStatus}}
has no {{equals}} method and {{ContainerStatusPBImpl}} is annotated as private and unstable.
It may not be a good idea to rely on the implementation of {{ContainerStatus}}.
* The use {{Set<ContainerStatus>}} never appears in the current code base.

that's limitation of the test, we should fix the tests.
Yes, I see. I will fix them.

> NM keeps sending finished containers to RM until app is finished
> ----------------------------------------------------------------
>                 Key: YARN-2997
>                 URL: https://issues.apache.org/jira/browse/YARN-2997
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>         Attachments: YARN-2997.2.patch, YARN-2997.patch
> We have seen in RM log a lot of
> {quote}
> INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null
container completed...
> {quote}
> It is caused by NM sending completed containers repeatedly until the app is finished.
On the RM side, the container is already released, hence {{getRMContainer}} returns null.

This message was sent by Atlassian JIRA

View raw message