hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere
Date Thu, 14 Jul 2011 20:01:04 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13065505#comment-13065505

Robert Joseph Evans commented on MAPREDUCE-2324:

I don't believe that the fix I submitted is incomplete the issue is that MRv2 does things
so very differently we need to tackle the problem in a different way.  I am sure the patch
is not perfect and I am very happy to see any better ideas/patches.  Also I am getting noise
from my customers about this so I would like to see a fix in a sustaining release.  It is
not a lot of noise but I do have to at least try to get a fix in.

I do agree that having different configuration values is an issue that I would like to avoid,
but currently 0.23 has dropped mapreduce.reduce.input.limit all together along with who knows
what other configuration values.  I do not see any way to maintain mapreduce.reduce.input.limit
in MRv2.

I have started looking at the scheduler code in yarn and this is just preliminary but it looks
like what we want to do is to extend Resource to include disk space not just RAM.  The NodeManager
can then also report back the amount of disk space that it has free, just like the TaskTracker
does.  Then for Reduce Tasks we teh MR Application Master can request the container based
off of the estimated reduce input size. We can also put in a more generic resource starvation
detection mechanism that would work for both RAM and Disk.

> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2,
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>         Attachments: MR-2324-security-v1.txt
> If there's a reduce task that needs more disk space than is available on any mapred.local.dir
in the cluster, that task will stay pending forever. For example, we produced this in a QA
cluster by accidentally running terasort with one reducer - since no mapred.local.dir had
1T free, the job remained in pending state for several days. The reason for the "stuck" task
wasn't clear from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs and finds
that there isn't enough space.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message