hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jothi Padmanabhan (JIRA)" <j...@apache.org>
Subject [jira] Commented: (MAPREDUCE-1202) Checksum error on a single reducer does not trigger too many fetch failures for mapper during shuffle
Date Wed, 11 Nov 2009 08:10:39 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-1202?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12776358#action_12776358
] 

Jothi Padmanabhan commented on MAPREDUCE-1202:
----------------------------------------------

This looks puzzling.0. Could you give us a little more details
# Number of maps/reducers in your job
# Were the other reducers able to fetch outputs from the map in question successfully?
# Is this reducer able to pull other map outputs successfully?

There are some inbuilt-checks so that the frame work does not kill maps aggressively, but
trying hundreds of times looks like something is definitely amiss

> Checksum error on a single reducer does not trigger too many fetch failures for mapper
during shuffle
> -----------------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-1202
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-1202
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: jobtracker
>    Affects Versions: 0.20.1
>            Reporter: Qi Liu
>            Priority: Critical
>
> During one run of a large map-reduce job, a single reducer keep throwing Checksum exception
when try to shuffle from one mapper. The data on the mapper node for that particular reducer
is believed to be corrupted, since there are disk issues on the mapper node. However, even
with hundreds of retries to fetch the shuffling data for that particular reducer, and numerous
reports to job tracker due to this issue, the mapper is still not declared as too many fetch
failures in job tracker.
> Here is the log:
> 2009-11-10 19:55:05,655 INFO org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0
Scheduled 1 outputs (0 slow hosts and0 dup hosts)
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: header: attempt_200911010621_0023_m_039676_0,
compressed len: 449177, decompressed len: 776729
> 2009-11-10 19:55:21,928 INFO org.apache.hadoop.mapred.ReduceTask: Shuffling 776729 bytes
(449177 raw bytes) into RAM from attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,737 INFO org.apache.hadoop.mapred.ReduceTask: Failed to shuffle from
attempt_200911010621_0023_m_039676_0
> org.apache.hadoop.fs.ChecksumException: Checksum Error
> 	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
> 	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
> 	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: attempt_200911010621_0023_r_005396_0
copy failed: attempt_200911010621_0023_m_039676_0 from xx.yy.com
> 2009-11-10 19:55:38,737 WARN org.apache.hadoop.mapred.ReduceTask: org.apache.hadoop.fs.ChecksumException:
Checksum Error
> 	at org.apache.hadoop.mapred.IFileInputStream.doRead(IFileInputStream.java:152)
> 	at org.apache.hadoop.mapred.IFileInputStream.read(IFileInputStream.java:101)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.getCompressedData(BlockDecompressorStream.java:104)
> 	at org.apache.hadoop.io.compress.BlockDecompressorStream.decompress(BlockDecompressorStream.java:82)
> 	at org.apache.hadoop.io.compress.DecompressorStream.read(DecompressorStream.java:74)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.shuffleInMemory(ReduceTask.java:1554)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.getMapOutput(ReduceTask.java:1433)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.copyOutput(ReduceTask.java:1286)
> 	at org.apache.hadoop.mapred.ReduceTask$ReduceCopier$MapOutputCopier.run(ReduceTask.java:1217)
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Task attempt_200911010621_0023_r_005396_0:
Failed fetch #113 from attempt_200911010621_0023_m_039676_0
> 2009-11-10 19:55:38,738 INFO org.apache.hadoop.mapred.ReduceTask: Failed to fetch map-output
from attempt_200911010621_0023_m_039676_0 even after MAX_FETCH_RETRIES_PER_MAP retries...
 or it is a read error,  reporting to the JobTracker

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message