Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 87426 invoked from network); 24 Nov 2008 18:12:42 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 24 Nov 2008 18:12:42 -0000 Received: (qmail 23685 invoked by uid 500); 24 Nov 2008 18:12:46 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 23649 invoked by uid 500); 24 Nov 2008 18:12:46 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 23630 invoked by uid 99); 24 Nov 2008 18:12:46 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Nov 2008 10:12:46 -0800 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 24 Nov 2008 18:11:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id 6F586234C294 for ; Mon, 24 Nov 2008 10:11:44 -0800 (PST) Message-ID: <464440580.1227550304454.JavaMail.jira@brutus> Date: Mon, 24 Nov 2008 10:11:44 -0800 (PST) From: "Christian Kunz (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-4714) map tasks timing out during merge phase In-Reply-To: <519168781.1227508784154.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650260#action_12650260 ] Christian Kunz commented on HADOOP-4714: ---------------------------------------- io.sort.mb=500 Avg size of record is 276 B. There are some bad outlayers of up to 3 MB, but their frequency is too small to be the reason for failure of reorting progress. I checked the full syslog of one of the tasks. The last merge started exactly 20 minutes (the configured timeout) before the time of failure, i.e. there was no progress reported at all. I am not familiar with progress reporting, but does progress() in writeFile() just set a flag with maybe no consequences? When checking the log of a successful task I noticed that the final merge lasted longer than 20 minutes, i.e. this task reported progress, but from the TaskTracker log there was no progress reported for 18 minutes into the merge phase (before it was every few seconds), i.e. with a default timeout of 10 minutes this task attempt would have failed as well. 2008-11-24 08:39:13,142 INFO org.apache.hadoop.mapred.MapTask: Finished spill 12 2008-11-24 08:39:16,383 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:16,681 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:16,832 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:17,020 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:17,302 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:17,995 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:18,109 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:18,360 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:18,487 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:18,844 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:19,016 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:19,081 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:19,119 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor 2008-11-24 08:39:19,350 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments 2008-11-24 08:39:20,240 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 111126 bytes 2008-11-24 08:39:20,338 INFO org.apache.hadoop.mapred.MapTask: Index: (0, 194236, 96533) 2008-11-24 08:39:20,989 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments 2008-11-24 08:39:21,343 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 115642 bytes 2008-11-24 08:39:21,381 INFO org.apache.hadoop.mapred.MapTask: Index: (96533, 199588, 100312) 2008-11-24 08:39:21,427 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments 2008-11-24 08:39:21,864 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 126500 bytes ... 2008-11-24 08:59:10,877 INFO org.apache.hadoop.mapred.MapTask: Index: (1318384976, 240135, 108120) 2008-11-24 08:59:10,899 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments 2008-11-24 08:59:11,057 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 109385 bytes 2008-11-24 08:59:11,798 WARN org.apache.hadoop.mapred.TaskRunner: Parent died. Exiting attempt_200811221852_0001_m_099999_0 > map tasks timing out during merge phase > --------------------------------------- > > Key: HADOOP-4714 > URL: https://issues.apache.org/jira/browse/HADOOP-4714 > Project: Hadoop Core > Issue Type: Bug > Components: mapred > Affects Versions: 0.18.1 > Reporter: Christian Kunz > > With compression of transient data turned on some parts of the merge phase seem to not report progress enough. > We see a lot of task failures during the merge phase, most of them timing out (even with a 20 min timeout) -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.