Mailing-List: contact hadoop-dev-help@lucene.apache.org; run by ezmlm
Precedence: bulk
Reply-To: hadoop-dev@lucene.apache.org
Message-ID: <16426113.1170716707402.JavaMail.jira@brutus>
Date: Mon, 5 Feb 2007 15:05:07 -0800 (PST)
From: "Arun C Murthy (JIRA)" <jira@apache.org>
To: hadoop-dev@lucene.apache.org
Subject: [jira] Commented: (HADOOP-979) speculative task failure can kill
 jobs
In-Reply-To: <29855646.1170712265497.JavaMail.jira@brutus>
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit


    [ https://issues.apache.org/jira/browse/HADOOP-979?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#action_12470392 ] 

Arun C Murthy commented on HADOOP-979:
--------------------------------------

Another point to ponder is that we might be getting too aggressive launching speculative attempts; a way around might have a larger backoff for each successive attempt; ensuring we don't launch too many of them too quickly. To illustrate:
task_1_0 is launched
task_1_1 is launched when task_1_0's progress falls behind other task by x%
task_1_2 is launched when task_1_0's progress falls behind other task by (x + x/8)%
task_1_3 is launched when task_1_0's progress falls behind other task by (x + x/4)%
task_1_4 is launched when task_1_0's progress falls behind other task by (x + x/2)%

Thoughts?

> speculative task failure can kill jobs
> --------------------------------------
>
>                 Key: HADOOP-979
>                 URL: https://issues.apache.org/jira/browse/HADOOP-979
>             Project: Hadoop
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.11.0
>            Reporter: Owen O'Malley
>             Fix For: 0.12.0
>
>
> We had a case where the random writer example was killed by speculative execution. It happened like:
> task_0001_m_000123_0 -> starts
> task_0001_m_000123_1 -> starts and fails because attempt 0 is creating the file
> task_0001_m_000123_2 -> starts and fails because attempt 0 is creating the file
> task_0001_m_000123_3 -> starts and fails because attempt 0 is creating the file
> task_0001_m_000123_4 -> starts and fails because attempt 0 is creating the file
> job_0001 is killed because map_000123 failed 4 times. From this experience, I think we should change the scheduling so that:
>   1. Tasks are only allowed 1 speculative attempt.
>   2. TIPs don't kill jobs until they have 4 failures AND the last task under that tip fails.
> Thoughts?

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.