hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chengbing Liu (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-2997) NM keeps sending finished containers to RM until app is finished
Date Tue, 06 Jan 2015 02:24:34 GMT

    [ https://issues.apache.org/jira/browse/YARN-2997?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14265572#comment-14265572

Chengbing Liu commented on YARN-2997:

I think this is not possible given that we are looping this.context.getContainers() which
is based on containerId to Container map. Or we can just use a list.
We are looping over {{context.getContainers()}}, plus possible remainders from the previous
heartbeat (in case of a lost heartbeat). If the previously completed container has its status
changed somehow, there would be two different ContainerStatus with same ID reported. That's
why I use a map, and use {{pendingCompletedContainers.put(containerId, containerStatus)}}
instead of {{containerStatuses.add(containerStatus)}} directly, in order to prevent such duplications
then we should send the pendingCompletedContainers in getNMContainerStatuses method too
We may not need to change {{getNMContainerStatuses}}, as it will send all container statuses
in NM context, except the containers whose application is not in NM context. I think that
will cover all elements in {{pendingCompletedContainers}}. And lost heartbeat is not a problem
with {{getNMContainerStatuses}}.
or we can just put it at the last line of removeOrTrackCompletedContainersFromContext so as
to avoid the newly added method. 
That's a good idea. I will change this in the next patch. Thanks for your advice!

> NM keeps sending finished containers to RM until app is finished
> ----------------------------------------------------------------
>                 Key: YARN-2997
>                 URL: https://issues.apache.org/jira/browse/YARN-2997
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: nodemanager
>    Affects Versions: 2.6.0
>            Reporter: Chengbing Liu
>         Attachments: YARN-2997.2.patch, YARN-2997.3.patch, YARN-2997.patch
> We have seen in RM log a lot of
> {quote}
> INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler: Null
container completed...
> {quote}
> It is caused by NM sending completed containers repeatedly until the app is finished.
On the RM side, the container is already released, hence {{getRMContainer}} returns null.

This message was sent by Atlassian JIRA

View raw message