hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Payne (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6166) Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk
Date Sat, 06 Dec 2014 15:00:20 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14236822#comment-14236822
] 

Eric Payne commented on MAPREDUCE-6166:
---------------------------------------

Thank you very much, [~jira.shegalov]
{quote}
I'am uploading a modified patch based on my previous review only with the intention to see
what if any tests would catch a missing checksum in ondisk-shuffle.
{quote}
The last segment of the test I added ({{TestFetcher#testCorruptedIFile}}) will catch that
the checksum is missing or incorrect when it tries to read the IFile that was shuffled to
disk by {{OnDiskMapOutput#shuffle}}

> Reducers do not catch bad map output transfers during shuffle if data shuffled directly
to disk
> -----------------------------------------------------------------------------------------------
>
>                 Key: MAPREDUCE-6166
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>          Components: mrv2
>    Affects Versions: 2.6.0
>            Reporter: Eric Payne
>            Assignee: Eric Payne
>         Attachments: MAPREDUCE-6166-gera-missing-cs-test.patch, MAPREDUCE-6166.v1.201411221941.txt,
MAPREDUCE-6166.v2.201411251627.txt, MAPREDUCE-6166.v3.txt
>
>
> In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map partition
output gets corrupted on disk on the map side. If this corrupted map output is too large to
shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs
this large, it could take hours before the reducer finally tries to read the corrupted file
and fails. Since retries of the failed reduce attempt will also take hours, this delay in
discovering the failure is multiplied greatly.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message