Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id A4A28DF05 for ; Wed, 7 Nov 2012 19:04:14 +0000 (UTC) Received: (qmail 82390 invoked by uid 500); 7 Nov 2012 19:04:13 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 82321 invoked by uid 500); 7 Nov 2012 19:04:13 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 82270 invoked by uid 99); 7 Nov 2012 19:04:13 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 07 Nov 2012 19:04:13 +0000 Date: Wed, 7 Nov 2012 19:04:13 +0000 (UTC) From: "Robert Joseph Evans (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <1116741034.82254.1352315053312.JavaMail.jiratomcat@arcas> In-Reply-To: <1002803755.70242.1352146812492.JavaMail.jiratomcat@arcas> Subject: [jira] [Commented] (MAPREDUCE-4772) Fetch failures can take way too long for a map to be restarted MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-4772?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13492595#comment-13492595 ] Robert Joseph Evans commented on MAPREDUCE-4772: ------------------------------------------------ Oh the only differences between 0.23 and trunk is that 0.23 includes one extra include in JobImpl that was not needed by trunk. > Fetch failures can take way too long for a map to be restarted > -------------------------------------------------------------- > > Key: MAPREDUCE-4772 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-4772 > Project: Hadoop Map/Reduce > Issue Type: Bug > Components: mrv2 > Affects Versions: 0.23.4 > Reporter: Robert Joseph Evans > Assignee: Robert Joseph Evans > Priority: Critical > Attachments: MR-4772-0.23.txt, MR-4772-trunk.txt > > > In one particular case we saw a NM go down at just the right time, that most of the reducers got the output of the map tasks, but not all of them. > The ones that failed to get the output reported to the AM rather quickly that they could not fetch from the NM, but because the other reducers were still running the AM would not relaunch the map task because there weren't more than 50% of the running reducers that had reported fetch failures. Then because of the exponential back-off for fetches on the reducers it took until 1 hour 45 min for the reduce tasks to hit another 10 fetch failures and report in again. At that point the other reducers had finished and the job relaunched the map task. If the reducers had still been running at 1:45 I have no idea how long it would have taken for each of the tasks to get to 30 fetch failures. > We need to trigger the map based off of percentage of reducers shuffling, not percentage of reducers running, we also need to have a maximum limit of the back off, so that we don't ever have the reducer waiting for days to try and fetch map output. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators For more information on JIRA, see: http://www.atlassian.com/software/jira