hadoop-common-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Amareshwari Sriramadasu (JIRA)" <j...@apache.org>
Subject [jira] Commented: (HADOOP-3462) reduce task failures during shuffling should not count against number of retry attempts
Date Mon, 28 Jul 2008 09:23:31 GMT

    [ https://issues.apache.org/jira/browse/HADOOP-3462?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12617393#action_12617393

Amareshwari Sriramadasu commented on HADOOP-3462:

There could be some problems in the suggested approach. For example, there could be a faulty
task which is writing to scratch space and making the disk out of space. Currently such tips
would get failed in four attempts, thereby kill the job; which is the intended behavior. But
making them FAILED_INTERNAL will not kill the job and just blacklist all the task trackers.

And also if there are map tasks generating large map output files or reduce tasks generating
large merge files, the job should get killed, instead of trying to run the map or reduce on
all the tasktrackers.

To address this,
1. One solution could be: we can have a configuration property _mapred.map/reduce.max.internal.failures_.
And a tip can be killed if the number of internal failures of the attempts exceed the _mapred.map/reduce.max.internal.failures_.
Then we have to decide on the default number for this. But, this approach could take more
time to kill the job.
2. Another solution could be to limit the disk space available for a task (something in the
lines of process-ulimit?). And fail the task if it is exceeding the allotted space. But here,
it would be difficult to keep track of disk space used by the task.   


> reduce task failures during shuffling should not count against number of retry attempts
> ---------------------------------------------------------------------------------------
>                 Key: HADOOP-3462
>                 URL: https://issues.apache.org/jira/browse/HADOOP-3462
>             Project: Hadoop Core
>          Issue Type: Bug
>          Components: mapred
>    Affects Versions: 0.16.3
>            Reporter: Christian Kunz
>            Assignee: Amareshwari Sriramadasu
>             Fix For: 0.19.0

This message is automatically generated by JIRA.
You can reply to this email to add a comment to the issue online.

View raw message