Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 31F71CEBD for ; Thu, 11 Dec 2014 17:10:16 +0000 (UTC) Received: (qmail 75065 invoked by uid 500); 11 Dec 2014 17:10:15 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 75003 invoked by uid 500); 11 Dec 2014 17:10:15 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 74992 invoked by uid 99); 11 Dec 2014 17:10:15 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 11 Dec 2014 17:10:15 +0000 Date: Thu, 11 Dec 2014 17:10:15 +0000 (UTC) From: "Eric Payne (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (MAPREDUCE-6166) Reducers do not catch corrupted map output transfers during shuffle if data shuffled directly to disk MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-6166?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Eric Payne updated MAPREDUCE-6166: ---------------------------------- Summary: Reducers do not catch corrupted map output transfers during shuffle if data shuffled directly to disk (was: Reducers do not catch bad map output transfers during shuffle if data shuffled directly to disk) > Reducers do not catch corrupted map output transfers during shuffle if data shuffled directly to disk > ----------------------------------------------------------------------------------------------------- > > Key: MAPREDUCE-6166 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-6166 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 2.6.0 > Reporter: Eric Payne > Assignee: Eric Payne > Attachments: MAPREDUCE-6166.v1.201411221941.txt, MAPREDUCE-6166.v2.201411251627.txt, MAPREDUCE-6166.v3.txt, MAPREDUCE-6166.v4.txt, MAPREDUCE-6166.v5.txt > > > In very large map/reduce jobs (50000 maps, 2500 reducers), the intermediate map partition output gets corrupted on disk on the map side. If this corrupted map output is too large to shuffle in memory, the reducer streams it to disk without validating the checksum. In jobs this large, it could take hours before the reducer finally tries to read the corrupted file and fails. Since retries of the failed reduce attempt will also take hours, this delay in discovering the failure is multiplied greatly. -- This message was sent by Atlassian JIRA (v6.3.4#6332)