hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gautam Kowshik (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-5949) JobTracker should give preference to failed tasks over virgin tasks so as to terminate the job ASAP if it is eventually going to fail.
Date Mon, 01 Jun 2009 12:58:07 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-5949?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12715076#action_12715076
] 

Gautam Kowshik commented on HADOOP-5949:
----------------------------------------

Would it make sense to provide a feature to be able to force a kill from within the job? Once
the user's mapreduce job detects that it has reached a state after which it can't resume,
it can hint/force the JT to end this job, an emergency button of sorts. This would empower
the user implementations to get out of the bad jobs asap and achieve better cluster utilization.


> JobTracker should give preference to failed tasks over virgin tasks so as to terminate
the job ASAP if it is eventually going to fail. 
> ---------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: HADOOP-5949
>                 URL: https://issues.apache.org/jira/browse/HADOOP-5949
>             Project: Hadoop Core
>          Issue Type: Improvement
>          Components: mapred
>            Reporter: Gautam Kowshik
>            Assignee: Devaraj Das
>
> Case in point... I have 1585 maps and 160 slots (40 nodes). The job is such that all
maps fail within 2-3 minutes. The job takes forever to realise that the job is bad. It took
2526 failures for it to reach 4 failed attempts for a task. 
> As I understand, currently the JT prefers a failed task if and only if a task tracker
with a split replica for that map came asking for a task. In fact there may not be a single
TT at all in the mapred cluster which has a replica for the splits used in this job (pre-0.20).
This delays the job failure by a lot and hence degrades cluster utilization as a whole. If
i'm on a shared cluster with many jobs waiting on it to fail, it's bad. 
> The JT should prefer a failed task a lot earlier than waiting for a data local TT to
come around asking. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Mime
View raw message