Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 14502D80B for ; Tue, 5 Mar 2013 13:05:26 +0000 (UTC) Received: (qmail 28623 invoked by uid 500); 5 Mar 2013 13:05:26 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 28316 invoked by uid 500); 5 Mar 2013 13:05:21 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 27945 invoked by uid 99); 5 Mar 2013 13:05:20 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 05 Mar 2013 13:05:20 +0000 Date: Tue, 5 Mar 2013 13:05:20 +0000 (UTC) From: "Hudson (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (MAPREDUCE-5043) Fetch failure processing can cause AM event queue to backup and eventually OOM MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-5043?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13593384#comment-13593384 ] Hudson commented on MAPREDUCE-5043: ----------------------------------- Integrated in Hadoop-Hdfs-trunk #1335 (See [https://builds.apache.org/job/Hadoop-Hdfs-trunk/1335/]) MAPREDUCE-5043. Fetch failure processing can cause AM event queue to backup and eventually OOM (Jason Lowe via bobby) (Revision 1452372) Result = SUCCESS bobby : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1452372 Files : * /hadoop/common/trunk/hadoop-mapreduce-project/CHANGES.txt * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/TaskAttempt.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/JobImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/main/java/org/apache/hadoop/mapreduce/v2/app/job/impl/TaskAttemptImpl.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/MockJobs.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-app/src/test/java/org/apache/hadoop/mapreduce/v2/app/TestRuntimeEstimators.java * /hadoop/common/trunk/hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-hs/src/main/java/org/apache/hadoop/mapreduce/v2/hs/CompletedTaskAttempt.java > Fetch failure processing can cause AM event queue to backup and eventually OOM > ------------------------------------------------------------------------------ > > Key: MAPREDUCE-5043 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-5043 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mr-am > Affects Versions: 0.23.7, 2.0.4-beta > Reporter: Jason Lowe > Assignee: Jason Lowe > Priority: Blocker > Fix For: 3.0.0, 0.23.7, 2.0.4-beta > > Attachments: MAPREDUCE-5043.patch > > > Saw an MRAppMaster with a 3G heap OOM. Upon investigating another instance of it running, we saw the UI in a weird state where the task table and task attempt tables in the job overview page weren't consistent. The AM log showed the AsyncDispatcher had hundreds of thousands of events in the event queue, and jstacks showed it spending a lot of time in fetch failure processing. It turns out fetch failure processing is currently *very* expensive, with a triple {{for}} loop where the inner loop is calling the quite-expensive {{TaskAttempt.getReport}}. That function ends up type-converting the entire task report, counters and all, and performing locale conversions among other things. It does this for every reduce task in the job, for every map task that failed. And when it's done building up the large task report, it pulls out one field, the phase, then throws the report away. > While the AM is busy processing fetch failures, tasks attempts are continuing to send events to the AM including memory-expensive events like status updates which include the counters. These back up in the AsyncDispatcher event queue and eventually even an AM with a large heap size will run out of memory and crash or expire because it thrashes in garbage collect. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira