Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <12553685.1190714990657.JavaMail.jira@brutus>
Date: Tue, 25 Sep 2007 03:09:50 -0700 (PDT)
From: "Arun C Murthy (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Updated: (HADOOP-1930) Too many fetch-failures issue
In-Reply-To: <1210130.1190320130985.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


     [ https://issues.apache.org/jira/browse/HADOOP-1930?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Arun C Murthy updated HADOOP-1930:
----------------------------------

    Fix Version/s: 0.15.0
           Status: Patch Available  (was: Open)

> Too many fetch-failures issue
> -----------------------------
>
>                 Key: HADOOP-1930
>                 URL: https://issues.apache.org/jira/browse/HADOOP-1930
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.15.0
>            Reporter: Christian Kunz
>            Assignee: Arun C Murthy
>            Priority: Blocker
>             Fix For: 0.15.0
>
>         Attachments: HADOOP-1930_1_20070922.patch, HADOOP-1930_2_20070925.patch
>
>
> A job with 4000 maps on a 1400 node cluster (3 tasks per node allowed) had a lot (150) of 'Too many fetch-failures' map failures.
> From the jobtracker log it looks as if it got confused which tasktracker actually ran the task:
> (In the following log output, I replaced the corresponding tasktracker nodes with ***node_assigned*** and ***node_fetch_attempt** and they are different)
> grep task_200709170247_0018_m_000009_0 hadoop-xxx-jobtracker-node.log.2007-09-19:
> 2007-09-19 15:52:26,907 INFO org.apache.hadoop.mapred.JobTracker: Adding task 'task_200709170247_0018_m_000009_0' to tip tip_200709170247_0018_m_000009, for tracker 'tracker_***node_assigned_***:/127.0.0.1:54523'
> 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskRunner: Saved output of task 'task_200709170247_0018_m_000009_0' to hdfs://location
> 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.JobInProgress: Task 'task_200709170247_0018_m_000009_0' has completed tip_200709170247_0018_m_000009 successfully.
> 2007-09-19 15:58:03,111 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200709170247_0018_m_000009_0' has completed succesfully
> 2007-09-19 16:21:07,825 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #1 for task task_200709170247_0018_m_000009_0
> 2007-09-19 16:23:23,483 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #2 for task task_200709170247_0018_m_000009_0
> 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Failed fetch notification #3 for task task_200709170247_0018_m_000009_0
> 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.JobInProgress: Too many fetch-failures for output of task: task_200709170247_0018_m_000009_0 ... killing it
> 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Error from task_200709170247_0018_m_000009_0: Too many fetch-failures
> 2007-09-19 16:25:07,182 INFO org.apache.hadoop.mapred.TaskInProgress: Task 'task_200709170247_0018_m_000009_0' has been lost.
> 2007-09-19 16:25:07,184 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200709170247_0018_m_000009_0' from 'tracker_***node_fetch_attempt***:/127.0.0.1:48818'
> 2007-09-19 21:40:00,235 INFO org.apache.hadoop.mapred.JobTracker: Removed completed task 'task_200709170247_0018_m_000009_0' from 'tracker_***node_fetch_attempt***:/127.0.0.1:48818'

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.