hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Qi Liu (JIRA)" <j...@apache.org>
Subject [jira] Created: (MAPREDUCE-1202) Checksum error on a single reducer does not trigger too many fetch failures for mapper during shuffle
Date Tue, 10 Nov 2009 20:17:28 GMT
Checksum error on a single reducer does not trigger too many fetch failures for mapper during
shuffle
-----------------------------------------------------------------------------------------------------

                 Key: MAPREDUCE-1202
                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
             Project: Hadoop Map/Reduce
          Issue Type: Bug
          Components: jobtracker
    Affects Versions: 0.20.1
            Reporter: Qi Liu
            Priority: Critical


During one run of a large map-reduce job, a single reducer keep throwing Checksum exception
when try to shuffle from one mapper. The data on the mapper node for that particular reducer
is believed to be corrupted, since there are disk issues on the mapper node. However, even
with hundreds of retries to fetch the shuffling data for that particular reducer, and numerous
reports to job tracker due to this issue, the mapper is still not declared as too many fetch
failures in job tracker.

Here is the log:
2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0
Scheduled 1 outputs (0 slow hosts and0 dup hosts)
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_200911010621_0023_m_039676_0,
compressed len: 449177, decompressed len: 776729
2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 776729 bytes (449177
raw bytes) into RAM from attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from attempt_200911010621_0023_m_039676_0
org.apache.hadoop.fs.ChecksumException: Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0
copy failed: attempt_200911010621_0023_m_039676_0 from xx.yy.com
2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: org.apache.hadoop.fs.ChecksumException:
Checksum Error
	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)

2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200911010621_0023_r_005396_0:
Failed fetch #113 from attempt_200911010621_0023_m_039676_0
2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output
from attempt_200911010621_0023_m_039676_0 even after MAX_FETCH_RETRIES_PER_MAP retries...
 or it is a read error,  reporting to the JobTracker


-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message