hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hudson (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6166) Reducers do not validate checksum of map outputs when fetching directly to disk
Date Tue, 16 Dec 2014 10:44:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14248116#comment-14248116
] 

Hudson commented on MAPREDUCE-6166:
-----------------------------------

FAILURE: Integrated in Hadoop-Yarn-trunk #777 (See [https://builds.apache.org/job/Hadoop-Yarn-trunk/777/])
MAPREDUCE-6166. Reducers do not validate checksum of map outputs when fetching directly to
disk. (Eric Payne via gera) (gera: rev af006937e8ba82f98f468dc7375fe89c2e0a7912)
* hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/test/java/org/apache/hadoop/mapreduce/task/reduce/TestFetcher.java
* hadoop-mapreduce-project/CHANGES.txt
* hadoop-mapreduce-project/hadoop-mapreduce-client/hadoop-mapreduce-client-core/src/main/java/org/apache/hadoop/mapreduce/task/reduce/OnDiskMapOutput.java


> Reducers do not validate checksum of map outputs when fetching directly to disk
> -------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6166
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.6.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>             Fix For: 2.7.0
>
>         Attachments: MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt,
MAPREDUCE-6166.v3.txt, MAPREDUCE-6166.v4.txt, MAPREDUCE-6166.v5.txt
>
>
> In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map partition
output gets corrupted on disk on the map side. If this corrupted map output is too large to
shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs
this large, it could take hours before the reducer finally tries to read the corrupted file
and fails. Since retries of the failed reduce attempt will also take hours, this delay in
discovering the failure is multiplied greatly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message