Return-Path: X-Original-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-mapreduce-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 260EE96F2 for ; Tue, 18 Oct 2011 00:24:35 +0000 (UTC) Received: (qmail 6062 invoked by uid 500); 18 Oct 2011 00:24:35 -0000 Delivered-To: apmail-hadoop-mapreduce-issues-archive@hadoop.apache.org Received: (qmail 6028 invoked by uid 500); 18 Oct 2011 00:24:35 -0000 Mailing-List: contact mapreduce-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: mapreduce-issues@hadoop.apache.org Delivered-To: mailing list mapreduce-issues@hadoop.apache.org Received: (qmail 6020 invoked by uid 99); 18 Oct 2011 00:24:34 -0000 Received: from athena.apache.org (HELO athena.apache.org) (140.211.11.136) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Oct 2011 00:24:34 +0000 X-ASF-Spam-Status: No, hits=-2000.5 required=5.0 tests=ALL_TRUSTED,RP_MATCHES_RCVD X-Spam-Check-By: apache.org Received: from [140.211.11.116] (HELO hel.zones.apache.org) (140.211.11.116) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 18 Oct 2011 00:24:33 +0000 Received: from hel.zones.apache.org (hel.zones.apache.org [140.211.11.116]) by hel.zones.apache.org (Postfix) with ESMTP id AE4372B1552 for ; Tue, 18 Oct 2011 00:24:13 +0000 (UTC) Date: Tue, 18 Oct 2011 00:24:13 +0000 (UTC) From: "Todd Lipcon (Assigned) (JIRA)" To: mapreduce-issues@hadoop.apache.org Message-ID: <592350695.3117.1318897453715.JavaMail.tomcat@hel.zones.apache.org> In-Reply-To: <597880026.12090.1318550531773.JavaMail.tomcat@hel.zones.apache.org> Subject: [jira] [Assigned] (MAPREDUCE-3184) Improve handling of fetch failures when a tasktracker is not responding on HTTP MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/MAPREDUCE-3184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Todd Lipcon reassigned MAPREDUCE-3184: -------------------------------------- Assignee: Todd Lipcon > Improve handling of fetch failures when a tasktracker is not responding on HTTP > ------------------------------------------------------------------------------- > > Key: MAPREDUCE-3184 > URL: https://issues.apache.org/jira/browse/MAPREDUCE-3184 > Project: Hadoop Map/Reduce > Issue Type: Improvement > Components: jobtracker > Affects Versions: 0.20.205.0 > Reporter: Todd Lipcon > Assignee: Todd Lipcon > Attachments: mr-3184.txt > > > On a 100 node cluster, we had an issue where one of the TaskTrackers was hit by MAPREDUCE-2386 and stopped responding to fetches. The behavior observed was the following: > - every reducer would try to fetch the same map task, and fail after ~13 minutes. > - At that point, all reducers would report this failed fetch to the JT for the same task, and the task would be re-run. > - Meanwhile, the reducers would move on to the next map task that ran on the TT, and hang for another 13 minutes. > The job essentially made no progress for hours, as each map task that ran on the bad node was serially marked failed. > To combat this issue, we should introduce a second type of failed fetch notification, used when the TT does not respond at all (ie SocketTimeoutException, etc). These fetch failure notifications should count against the TT at large, rather than a single task. If more than half of the reducers report such an issue for a given TT, then all of the tasks from that TT should be re-run. -- This message is automatically generated by JIRA. If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa For more information on JIRA, see: http://www.atlassian.com/software/jira