Return-Path: Delivered-To: apmail-lucene-hadoop-dev-archive@locus.apache.org Received: (qmail 52971 invoked from network); 25 Sep 2007 10:10:15 -0000 Received: from hermes.apache.org (HELO mail.apache.org) (140.211.11.2) by minotaur.apache.org with SMTP; 25 Sep 2007 10:10:15 -0000 Received: (qmail 9806 invoked by uid 500); 25 Sep 2007 10:10:05 -0000 Delivered-To: apmail-lucene-hadoop-dev-archive@lucene.apache.org Received: (qmail 9537 invoked by uid 500); 25 Sep 2007 10:10:04 -0000 Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: hadoop-dev@lucene.apache.org Delivered-To: mailing list hadoop-dev@lucene.apache.org Received: (qmail 9525 invoked by uid 99); 25 Sep 2007 10:10:04 -0000 Received: from nike.apache.org (HELO nike.apache.org) (192.87.106.230) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 03:10:04 -0700 X-ASF-Spam-Status: No, hits=-98.8 required=10.0 tests=ALL_TRUSTED,DNS_FROM_DOB,RCVD_IN_DOB X-Spam-Check-By: apache.org Received: from [140.211.11.4] (HELO brutus.apache.org) (140.211.11.4) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Sep 2007 10:12:26 +0000 Received: from brutus (localhost [127.0.0.1]) by brutus.apache.org (Postfix) with ESMTP id A5100714163 for ; Tue, 25 Sep 2007 03:09:50 -0700 (PDT) Message-ID: <12553685.1190714990657.JavaMail.jira@brutus> Date: Tue, 25 Sep 2007 03:09:50 -0700 (PDT) From: "Arun C Murthy (JIRA)" To: hadoop-dev@lucene.apache.org Subject: [jira] Updated: (HADOOP-1930) Too many fetch-failures issue In-Reply-To: <1210130.1190320130985.JavaMail.jira@brutus> MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-Virus-Checked: Checked by ClamAV on apache.org [ https://issues.apache.org/jira/browse/HADOOP-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Arun C Murthy updated HADOOP-1930: ---------------------------------- Fix Version/s: 0.15.0 Status: Patch Available (was: Open) > Too many fetch-failures issue > ----------------------------- > > Key: HADOOP-1930 > URL: https://issues.apache.org/jira/browse/HADOOP-1930 > Project: Hadoop > Issue Type: Bug > Components: mapred > Affects Versions: 0.15.0 > Reporter: Christian Kunz > Assignee: Arun C Murthy > Priority: Blocker > Fix For: 0.15.0 > > Attachments: HADOOP-1930_1_20070922.patch, HADOOP-1930_2_20070925.patch > > > A job with 4000 maps on a 1400 node cluster (3 tasks per node allowed) had a lot (150) of 'Too many fetch-failures' map failures. > From the jobtracker log it looks as if it got confused which tasktracker actually ran the task: > (In the following log output, I replaced the corresponding tasktracker nodes with ***node_assigned*** and ***node_fetch_attempt** and they are different) > grep task_200709170247_0018_m_000009_0 hadoop-xxx-jobtracker-node.log.2007-09-19: > 2007-09-19 15:52:26,907 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200709170247_0018_m_000009_0' to tip tip_200709170247_0018_m_000009, for tracker 'tracker_***node_assigned_***:/127.0.0.1:54523' > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200709170247_0018_m_000009_0' to hdfs://location > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200709170247_0018_m_000009_0' has completed tip_200709170247_0018_m_000009 successfully. > 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200709170247_0018_m_000009_0' has completed succesfully > 2007-09-19 16:21:07,825 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:23:23,483 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200709170247_0018_m_000009_0 > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_200709170247_0018_m_000009_0 ... killing it > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200709170247_0018_m_000009_0: Too many fetch-failures > 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200709170247_0018_m_000009_0' has been lost. > 2007-09-19 16:25:07,184 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200709170247_0018_m_000009_0' from 'tracker_***node_fetch_attempt***:/127.0.0.1:48818' > 2007-09-19 21:40:00,235 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200709170247_0018_m_000009_0' from 'tracker_***node_fetch_attempt***:/127.0.0.1:48818' -- This message is automatically generated by JIRA. - You can reply to this email to add a comment to the issue online.