Return-Path: Delivered-To: apmail-hadoop-core-dev-archive@www.apache.org Received: (qmail 6692 invoked from network); 8 Jun 2009 20:44:21 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.3) by minotaur.apache.org with SMTP; 8 Jun 2009 20:44:21 -0000 Received: (qmail 85109 invoked by uid 500); 8 Jun 2009 20:44:32 -0000 Delivered-To: apmail-hadoop-core-dev-archive@hadoop.apache.org Received: (qmail 85046 invoked by uid 500); 8 Jun 2009 20:44:31 -0000 Mailing-List: contact core-dev-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: core-dev@hadoop.apache.org Delivered-To: mailing list core-dev@hadoop.apache.org Received: (qmail 85036 invoked by uid 99); 8 Jun 2009 20:44:31 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2009 20:44:31 +0000 X-ASF-Spam-Status: No, hits=-2000.0 required=10.0 tests=ALL_TRUSTED X-Spam-Check-By: apache.org Received: from [140.211.11.140] (HELO brutus.apache.org) (140.211.11.140) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 08 Jun 2009 20:44:28 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id C95D1234C004 for ; Mon, 8 Jun 2009 13:44:07 -0700 (PDT) Message-ID: <1536919889.1244493847817.JavaMail.jira@brutus> Date: Mon, 8 Jun 2009 13:44:07 -0700 (PDT) From: "Aaron Kimball (JIRA)" To: core-dev@hadoop.apache.org Subject: [jira] Commented: (HADOOP-5985) A single slow (but not dead) map TaskTracker impedes MapReduce progress In-Reply-To: <901992913.1244247967376.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-5985?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717401#action_12717401 ] Aaron Kimball commented on HADOOP-5985: --------------------------------------- I was under the impression that if a map task died before delivering output to all of the reducers, then after the map task is re-executed elsewhere, all reducers roll back and re-pull from the newer version of the mapper. I could be mistaken though? Because if that's the case, then we just need to run another round of speculative execution during the final shufflings, even if the map tasks themselves were marked as "complete," and modify the reducers to try to pull from a list of eligible mappers instead of just a single node. > A single slow (but not dead) map TaskTracker impedes MapReduce progress > ----------------------------------------------------------------------- > > Key: HADOOP-5985 > URL: https://issues.apache.org/jira/browse/HADOOP-5985 > Project: Hadoop Core > Issue Type: Bug > Affects Versions: 0.18.3 > Reporter: Aaron Kimball > > We see cases where there may be a large number of mapper nodes running many tasks (e.g., a thousand). The reducers will pull 980 of the map task intermediate files down, but will be unable to retrieve the final intermediate shards from the last node. The TaskTracker on that node returns data to reducers either slowly or not at all, but its heartbeat messages make it back to the JobTracker -- so the JobTracker doesn't mark the tasks as failed. Manually stopping the offending TaskTracker works to migrate the tasks to other nodes, where the shuffling process finishes very quickly. Left on its own, it can take hours to unjam itself otherwise. > We need a mechanism for reducers to provide feedback to the JobTracker that one of the mapper nodes should be regarded as lost. -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.