hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4862) Handle duplicate completed containers in RMNodeImpl
Date Tue, 21 Jun 2016 10:42:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-4862?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15341538#comment-15341538
] 

Rohith Sharma K S commented on YARN-4862:
-----------------------------------------

Hi [~jianhe], apologies for long delay!!
  In a positive case flow is NM inform RM that container is finished intern RM wait for AM
to pull finished containers and after AM pulls the finished containers RM informs to NM that
remove from NMContext.

In preemption flow, 
# RM preempt the containers which inform RMContainerImpl first that KillContainer. 
# In KillContainer#transistion, informs the RMnodeImpl to cleanUpTheContainers and also inform
RMAppAttemptImpl that add to JustFinishedContainers so that let AM pulls finished containers
on next heartbeat. It is assumedthat containersToCleanUp will be sent first to NM and later
containersToBeRemovedFromNM is sent next heartbeat of NM. 

I see that there is *potential container leak in NodeManager module* in preemption flow. There
can be situation where {{containersToCleanUp }} and {{containersToBeRemovedFromNM }} can go
together in the same heartbeat. If same containerId details sent to NM together, then container
will never-ever removed in NMContext.

CC :/ [~jlowe]  Basically I feel it is bug from RM that should inform back to RMNode if rmContainer
is null whenever finished containers are received from NM 


And for this JIRA, I think current patch approach should be fine if we fix the above mentioned
issue. Thoughts?

> Handle duplicate completed containers in RMNodeImpl
> ---------------------------------------------------
>
>                 Key: YARN-4862
>                 URL: https://issues.apache.org/jira/browse/YARN-4862
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>            Reporter: Rohith Sharma K S
>            Assignee: Rohith Sharma K S
>         Attachments: 0001-YARN-4862.patch, 0002-YARN-4862.patch
>
>
> As per [comment|https://issues.apache.org/jira/browse/YARN-4852?focusedCommentId=15209689&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-15209689]
from [~sharadag], there should be safe guard for duplicated container status in RMNodeImpl
before creating UpdatedContainerInfo. 
> Or else in heavily loaded cluster where event processing is gradually slow, if any duplicated
container are sent to RM(may be bug in NM also), there is significant impact that RMNodImpl
always create UpdatedContainerInfo for duplicated containers. This result in increase in the
heap memory and causes problem like YARN-4852.
> This is an optimization for issue kind YARN-4852



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message