Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A8B1A160C05 for ; Wed, 3 Jan 2018 20:33:04 +0100 (CET) Received: (qmail 79956 invoked by uid 500); 3 Jan 2018 19:33:03 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 79945 invoked by uid 99); 3 Jan 2018 19:33:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Jan 2018 19:33:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id 587C01A0758 for ; Wed, 3 Jan 2018 19:33:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -100.011 X-Spam-Level: X-Spam-Status: No, score=-100.011 tagged_above=-999 required=6.31 tests=[RCVD_IN_DNSWL_NONE=-0.0001, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id 1adbTspnB2MC for ; Wed, 3 Jan 2018 19:33:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 496405FAC9 for ; Wed, 3 Jan 2018 19:33:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 6FC8CE065D for ; Wed, 3 Jan 2018 19:33:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 0D37A240DE for ; Wed, 3 Jan 2018 19:33:00 +0000 (UTC) Date: Wed, 3 Jan 2018 19:33:00 +0000 (UTC) From: "Xuan Gong (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-7697) NM goes down with OOM due to leak in log-aggregation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 03 Jan 2018 19:33:05 -0000 [ https://issues.apache.org/jira/browse/YARN-7697?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16310156#comment-16310156 ] Xuan Gong commented on YARN-7697: --------------------------------- The issue happens after file truncate process. Looks like the truncate API return false instead of throw exception, so we still read the corrupted aggregated log. In the process of reading logs, we would allocate a byte array {code} byte[] array = new byte[offset]; // this line throws OOM fsDataIStream.seek( fileLength - offset - Integer.SIZE/ Byte.SIZE - UUID_LENGTH); {code} So, the offset is in-correct, and probably a invalid big value, we could get OOM in NM. > NM goes down with OOM due to leak in log-aggregation > ---------------------------------------------------- > > Key: YARN-7697 > URL: https://issues.apache.org/jira/browse/YARN-7697 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Santhosh B Gowda > Assignee: Xuan Gong > > 2017-12-29 01:43:50,601 FATAL yarn.YarnUncaughtExceptionHandler (YarnUncaughtExceptionHandler.java:uncaughtException(51)) - Thread Thread[LogAggregationService #0,5,main] threw an Error. Shutting down now... > java.lang.OutOfMemoryError: Java heap space > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.loadIndexedLogsMeta(LogAggregationIndexedFileController.java:823) > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.loadIndexedLogsMeta(LogAggregationIndexedFileController.java:840) > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriterInRolling(LogAggregationIndexedFileController.java:293) > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.access$600(LogAggregationIndexedFileController.java:98) > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController$1.run(LogAggregationIndexedFileController.java:216) > at java.security.AccessController.doPrivileged(Native Method) > at javax.security.auth.Subject.doAs(Subject.java:422) > at org.apache.hadoop.security.UserGroupInformation.doAs(UserGroupInformation.java:1866) > at org.apache.hadoop.yarn.logaggregation.filecontroller.ifile.LogAggregationIndexedFileController.initializeWriter(LogAggregationIndexedFileController.java:197) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.uploadLogsForContainers(AppLogAggregatorImpl.java:205) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.doAppLogAggregation(AppLogAggregatorImpl.java:312) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.AppLogAggregatorImpl.run(AppLogAggregatorImpl.java:284) > at org.apache.hadoop.yarn.server.nodemanager.containermanager.logaggregation.LogAggregationService$1.run(LogAggregationService.java:262) > at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149) > at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624) > at java.lang.Thread.run(Thread.java:748) > 2017-12-29 01:43:50,601 INFO application.ApplicationImpl (ApplicationImpl.java:handle(464)) - Application ap -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org