Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Date: Wed, 13 Jul 2016 16:50:20 +0000 (UTC)
From: "Manikandan R (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.12989019.1468393668000.17998.1468428620929@Atlassian.JIRA>
In-Reply-To: <JIRA.12989019.1468393668000@Atlassian.JIRA>
References: <JIRA.12989019.1468393668000@Atlassian.JIRA> <JIRA.12989019.1468393668016@arcas>
Subject: [jira] [Commented] (YARN-5370) Setting
 yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because
 of OOM
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit
archived-at: Wed, 13 Jul 2016 16:50:23 -0000


    [ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375339#comment-15375339 ] 

Manikandan R commented on YARN-5370:
------------------------------------

To solve this issue, we tried by setting yarn.nodemanager.delete.debug-delay-sec to very low value (zero second) assuming that it may clear off the existing scheduled deletion tasks. It didn't happen - basically it is not applied for the existing tasks which has been already scheduled. Then, we come to know that canRecover() method is getting called in service start, which is trying to pull the info from NM recovery directory (from local filesystem) and building this entire info in memory, which in turn, causing the problems in starting the services and consuming so much amount of memory. Then, we tried by moving the contents of NM recovery directory to some other place. From this points onwards, it was able to start smoothly and works as expected. I think showing some warnings about this high value (for ex, 100+ days) somewhere (for ex, in logs) indicating that it can cause potential crash can saving significant amount of time to troubleshoot this issue.

> Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-5370
>                 URL: https://issues.apache.org/jira/browse/YARN-5370
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Manikandan R
>
> I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev  cluster for some reasons. It has been done before 3-4 weeks. After setting this up, at times, NM crashes because of OOM. So, I kept on increasing from 512MB to 6 GB over the past few weeks gradually as and when this crash occurs as temp fix. Sometimes, It won't start smoothly and after multiple tries, it starts functioning. While analyzing heap dump of corresponding JVM, come to know that DeletionService.Java is occupying almost 99% of total allocated memory (-xmx) something like this
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13%
> Basically, there are huge no. of above mentioned tasks scheduled for deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. In my case, cluster is very small and OOM occurs.
> Is it expected behaviour? (or) Is there any limit we can expose on yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues?


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org