Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
Reply-To: core-dev@hadoop.apache.org
Message-ID: <464440580.1227550304454.JavaMail.jira@brutus>
Date: Mon, 24 Nov 2008 10:11:44 -0800 (PST)
From: "Christian Kunz (JIRA)" <jira@apache.org>
To: core-dev@hadoop.apache.org
Subject: [jira] Commented: (HADOOP-4714) map tasks timing out during merge
 phase
In-Reply-To: <519168781.1227508784154.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-4714?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12650260#action_12650260 ] 

Christian Kunz commented on HADOOP-4714:
----------------------------------------

io.sort.mb=500
Avg size of record is 276 B. There are some bad outlayers of up to 3 MB, but their frequency is too small to be the reason for failure of reorting progress.

I checked the full syslog of one of the tasks. The last merge started exactly 20 minutes (the configured timeout) before the time of failure, i.e. there was no progress reported at all. I am not familiar with progress reporting, but does progress() in writeFile() just set a flag with maybe no consequences?
When checking the log of a successful task I noticed that the final merge lasted longer than 20 minutes, i.e. this task reported progress, but from the TaskTracker log there was no progress reported for 18 minutes into the merge phase (before it was every few seconds), i.e. with a default timeout of 10 minutes this task attempt would have failed as well.

2008-11-24 08:39:13,142 INFO org.apache.hadoop.mapred.MapTask: Finished spill 12
2008-11-24 08:39:16,383 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:16,681 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:16,832 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:17,020 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:17,302 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:17,995 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:18,109 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:18,360 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:18,487 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:18,844 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:19,016 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:19,081 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:19,119 INFO org.apache.hadoop.io.compress.CodecPool: Got brand-new decompressor
2008-11-24 08:39:19,350 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments
2008-11-24 08:39:20,240 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 111126 bytes
2008-11-24 08:39:20,338 INFO org.apache.hadoop.mapred.MapTask: Index: (0, 194236, 96533)
2008-11-24 08:39:20,989 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments
2008-11-24 08:39:21,343 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 115642 bytes
2008-11-24 08:39:21,381 INFO org.apache.hadoop.mapred.MapTask: Index: (96533, 199588, 100312)
2008-11-24 08:39:21,427 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments
2008-11-24 08:39:21,864 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 126500 bytes
...
2008-11-24 08:59:10,877 INFO org.apache.hadoop.mapred.MapTask: Index: (1318384976, 240135, 108120)
2008-11-24 08:59:10,899 INFO org.apache.hadoop.mapred.Merger: Merging 13 sorted segments
2008-11-24 08:59:11,057 INFO org.apache.hadoop.mapred.Merger: Down to the last merge-pass, with 13 segments left of total size: 109385 bytes
2008-11-24 08:59:11,798 WARN org.apache.hadoop.mapred.TaskRunner: Parent died.  Exiting attempt_200811221852_0001_m_099999_0


> map tasks timing out during merge phase
> ---------------------------------------
>
>                 Key: HADOOP-4714
>                 URL: https://issues.apache.org/jira/browse/HADOOP-4714
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.18.1
>            Reporter: Christian Kunz
>
> With compression of transient data turned on some parts of the merge phase seem to not report progress enough.
> We see a lot of task failures during the merge phase, most of them timing out (even with a 20 min timeout)

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.