Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6C21D200B3C for ; Wed, 13 Jul 2016 18:50:23 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6AF88160A62; Wed, 13 Jul 2016 16:50:23 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B2453160A6A for ; Wed, 13 Jul 2016 18:50:22 +0200 (CEST) Received: (qmail 97755 invoked by uid 500); 13 Jul 2016 16:50:21 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 97477 invoked by uid 99); 13 Jul 2016 16:50:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 13 Jul 2016 16:50:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id E3AE12C02BF for ; Wed, 13 Jul 2016 16:50:20 +0000 (UTC) Date: Wed, 13 Jul 2016 16:50:20 +0000 (UTC) From: "Manikandan R (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5370) Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 13 Jul 2016 16:50:23 -0000 [ https://issues.apache.org/jira/browse/YARN-5370?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15375339#comment-15375339 ] Manikandan R commented on YARN-5370: ------------------------------------ To solve this issue, we tried by setting yarn.nodemanager.delete.debug-delay-sec to very low value (zero second) assuming that it may clear off the existing scheduled deletion tasks. It didn't happen - basically it is not applied for the existing tasks which has been already scheduled. Then, we come to know that canRecover() method is getting called in service start, which is trying to pull the info from NM recovery directory (from local filesystem) and building this entire info in memory, which in turn, causing the problems in starting the services and consuming so much amount of memory. Then, we tried by moving the contents of NM recovery directory to some other place. From this points onwards, it was able to start smoothly and works as expected. I think showing some warnings about this high value (for ex, 100+ days) somewhere (for ex, in logs) indicating that it can cause potential crash can saving significant amount of time to troubleshoot this issue. > Setting yarn.nodemanager.delete.debug-delay-sec to high number crashes NM because of OOM > ---------------------------------------------------------------------------------------- > > Key: YARN-5370 > URL: https://issues.apache.org/jira/browse/YARN-5370 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Manikandan R > > I set yarn.nodemanager.delete.debug-delay-sec to 100 + days in my dev cluster for some reasons. It has been done before 3-4 weeks. After setting this up, at times, NM crashes because of OOM. So, I kept on increasing from 512MB to 6 GB over the past few weeks gradually as and when this crash occurs as temp fix. Sometimes, It won't start smoothly and after multiple tries, it starts functioning. While analyzing heap dump of corresponding JVM, come to know that DeletionService.Java is occupying almost 99% of total allocated memory (-xmx) something like this > org.apache.hadoop.yarn.server.nodemanager.DeletionService$DelServiceSchedThreadPoolExecutor @ 0x6c1d09068| 80 | 3,544,094,696 | 99.13% > Basically, there are huge no. of above mentioned tasks scheduled for deletion. Usually, I see NM memory requirements as 2-4GB for large clusters. In my case, cluster is very small and OOM occurs. > Is it expected behaviour? (or) Is there any limit we can expose on yarn.nodemanager.delete.debug-delay-sec to avoid these kind of issues? -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org