hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Manikandan R (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5370) Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM
Date Wed, 13 Jul 2016 16:50:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375339#comment-15375339

Manikandan R commented on YARN-5370:

To solve this issue, we tried by setting yarn.nodemanager.delete.debug-delay-sec to very low
value (zero second) assuming that it may clear off the existing scheduled deletion tasks.
It didn't happen - basically it is not applied for the existing tasks which has been already
scheduled. Then, we come to know that canRecover() method is getting called in service start,
which is trying to pull the info from NM recovery directory (from local filesystem) and building
this entire info in memory, which in turn, causing the problems in starting the services and
consuming so much amount of memory. Then, we tried by moving the contents of NM recovery directory
to some other place. From this points onwards, it was able to start smoothly and works as
expected. I think showing some warnings about this high value (for ex, 100+ days) somewhere
(for ex, in logs) indicating that it can cause potential crash can saving significant amount
of time to troubleshoot this issue.

> Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of
> ----------------------------------------------------------------------------------------
>                 Key: YARN-5370
>                 URL: https://issues.apache.org/jira/browse/YARN-5370
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Manikandan R
> I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev  cluster for some
reasons. It has been done before 3-4 weeks. After setting this up, at times, NM crashes because
of OOM. So, I kept on increasing from 512MB to 6 GB over the past few weeks gradually as and
when this crash occurs as temp fix. Sometimes, It won't start smoothly and after multiple
tries, it starts functioning. While analyzing heap dump of corresponding JVM, come to know
that DeletionService.Java is occupying almost 99% of total allocated memory (-xmx) something
like this
> org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor
@ 0x6c1d09068| 80 | 3,544,094,696 | 99.13%
> Basically, there are huge no. of above mentioned tasks scheduled for deletion. Usually,
I see NM memory requirements as 2-4GB for large clusters. In my case, cluster is very small
and OOM occurs.
> Is it expected behaviour? (or) Is there any limit we can expose on yarn.nodemanager.delete.debug-delay-sec
to avoid these kind of issues?

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message