hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "qiuliang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-9437) RMNodeImpls occupy too much memory and causes RM GC to take a long time
Date Sun, 05 May 2019 03:36:01 GMT

    [ https://issues.apache.org/jira/browse/YARN-9437?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16811738#comment-16811738
] 

qiuliang edited comment on YARN-9437 at 5/5/19 3:35 AM:
--------------------------------------------------------

According to my understanding, there are two cases that may cause the completedContainers
in RMNodeImpl to not be released.
1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) event, it will add
this container to justFinishedContainers. When processing the AM heartbeat, RMAppAttemptImpl
first sends the container in finishedContainersSentToAM to NM, and RMNodeImpl also removes
these containers from the completedContainers. Then transfer the containers in justFinishedContainers
to finishedContainersSentToAM and wait for the next AM heartbeat to send these containers
to NM. If RMAppAttemptImpl accepts the event of AM unregistration, justFinishedContainers
is not empty, then the container in justFinishedContainers may not have the opportunity to
transfer to finishedContainersSentToAM, so that these containers are not sent to NM, and RMNodeImpl
does not release these containers.
2. When RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event,
just add this container to justFinishedContainers and not send it to NM.
For the first case, my idea is that when RMAppAttemptImpl handles the amContainer finished
event, the container in justFinishedContainers is transferred to finishedContainersSentToAM
and sent to NM along with amContainer. I am not sure if there is any other impact. For the
second case, when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED
event, these containers are sent directly to NM, but I am worried that this will generate
many events.


was (Author: qiuliang988):
As I understand it, there are two cases that may cause the completedContainers in RMNodeImpl
to not be released.
1. When RMAppAttemptImpl receives the CONTAINER_FINISHED(not amContainer) event, it will add
this container to justFinishedContainers. When processing the AM heartbeat, RMAppAttemptImpl
first sends the container in finishedContainersSentToAM to NM, and RMNodeImpl also removes
these containers from the completedContainers. Then transfer the containers in justFinishedContainers
to finishedContainersSentToAM and wait for the next AM heartbeat to send these containers
to NM. If RMAppAttemptImpl accepts the event of AM unregistration, justFinishedContainers
is not empty, then the container in justFinishedContainers may not have the opportunity to
transfer to finishedContainersSentToAM, so that these containers are not sent to NM, and RMNodeImpl
does not release these containers.
2. When RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED event,
just add this container to justFinishedContainers and not send it to NM.
For the first case, my idea is that when RMAppAttemptImpl handles the amContainer finished
event, the container in justFinishedContainers is transferred to finishedContainersSentToAM
and sent to NM along with amContainer. I am not sure if there is any other impact. For the
second case, when RMAppAttemptImpl is in the final state and receives the CONTAINER_FINISHED
event, these containers are sent directly to NM, but I am worried that this will generate
many events.

> RMNodeImpls occupy too much memory and causes RM GC to take a long time
> -----------------------------------------------------------------------
>
>                 Key: YARN-9437
>                 URL: https://issues.apache.org/jira/browse/YARN-9437
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.9.1
>            Reporter: qiuliang
>            Priority: Minor
>         Attachments: 1.png, 2.png, 3.png, YARN-9437-v1.txt
>
>
> We use hadoop-2.9.1 in our production environment with 1600+ nodes. 95.63% of RM memory
is occupied by RMNodeImpl. Analysis of RM memory found that each RMNodeImpl has approximately 14M. The
reason is that there is a 130,000+ completedcontainers in each RMNodeImpl that has not been
released.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message