hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith Sharma K S (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-4852) Resource Manager Ran Out of Memory
Date Thu, 24 Mar 2016 02:05:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-4852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15209583#comment-15209583

Rohith Sharma K S commented on YARN-4852:

Thanks for bringing out duplicated container status stored in UpdatedContainerInfo. This makes
to think of ticket YARN-2997 which is already solved.

Scenario is NM keeps the containers in NMContext as long as RM sends notification to NM in
response to remove from NM. Every heart beat these(pendingCompletedContainers) container status
is sent to RM which could be duplicated!!  But from RM , while creating UpdatedContainerInfo
validation is not done for duplicated entries. This is keep accumulating when there is slow
in scheduler event processing.

> Resource Manager Ran Out of Memory
> ----------------------------------
>                 Key: YARN-4852
>                 URL: https://issues.apache.org/jira/browse/YARN-4852
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.6.0
>            Reporter: Gokul
>         Attachments: threadDump.log
> Resource Manager went out of memory (max heap size: 8 GB, CMS GC) and shut down itself.

> Heap dump analysis reveals that 1200 instances of RMNodeImpl class hold 86% of memory.
When digging  deeper, there are around 0.5 million objects of UpdatedContainerInfo (nodeUpdateQueue
inside RMNodeImpl). This in turn contains around 1.7 million objects of YarnProtos$ContainerIdProto,
ContainerStatusProto, ApplicationAttemptIdProto, ApplicationIdProto each of which retain around
1 GB heap.
> Back to Back Full GC kept on happening. GC wasn't able to recover any heap and went OOM.
JVM dumped the heap before quitting. We analyzed the heap. 
> RM's usual heap usage is around 4 GB but it suddenly spiked to 8 GB in 20 mins time and
went OOM.
> There are no spike in job submissions, container numbers at the time of issue occurrence.

This message was sent by Atlassian JIRA

View raw message