hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (Commented) (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere
Date Wed, 23 Nov 2011 18:13:42 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13156015#comment-13156015

Robert Joseph Evans commented on MAPREDUCE-2324:

I believe that the same issue could happen on the mapper side, except I have never seen it
actually happen.  We saw this actually happen on the reducer side in a few instances which
is why I put in the patch.  we could do something similar on the mapper side, but it was not
as urgent.

Yes in theory if the getEstimatedReduceInputSide worked correctly, and there was only one
node with low disk space then that failure could have been avoided.  I originally looked at
implementing something that would try to assign the task to several different nodes before
giving up.  But at what point do we say that we have tried enough?  Answering that question
along with the memory requirements to do that for every single task attempt resulted in a
very complicated solution.  The reason this fix was selected was because it is very simple
compaired to the other ones (less risk of breaking something) also getExtimatedReduceInputSize
has some issues with it, Arun can better describe them then I can.  He wanted to push for
us to address those issues as the ultimate fix for this.

I hope that helps.
> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2,
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>             Fix For:
>         Attachments: MR-2324-disable-check-v2.patch, MR-2324-security-v1.txt, MR-2324-security-v2.txt,
MR-2324-security-v3.patch, MR-2324-secutiry-just-log-v1.patch
> If there's a reduce task that needs more disk space than is available on any mapred.local.dir
in the cluster, that task will stay pending forever. For example, we produced this in a QA
cluster by accidentally running terasort with one reducer - since no mapred.local.dir had
1T free, the job remained in pending state for several days. The reason for the "stuck" task
wasn't clear from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs and finds
that there isn't enough space.

This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message