hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere
Date Mon, 01 Aug 2011 19:31:10 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073677#comment-13073677
] 

Robert Joseph Evans commented on MAPREDUCE-2324:
------------------------------------------------

I have been able to run gridmix on a 10 node cluster, and everything looks stable.  I have
not been able to run it on anything larger because the processes here are really not set up
to do that very easily.  The process in the past has been to run gridmix at scale after the
branch is in QA not before then so the tools are not setup to deploy from a dev branch.  Plus
I have to get approval from lots of people to make that happen.  I am trying to see if I can
still do it, but I am not very hopeful that it will happen any time soon.

> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2, 0.20.205.0
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>             Fix For: 0.20.205.0
>
>         Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, MR-2324-security-v3.patch,
MR-2324-secutiry-just-log-v1.patch
>
>
> If there's a reduce task that needs more disk space than is available on any mapred.local.dir
in the cluster, that task will stay pending forever. For example, we produced this in a QA
cluster by accidentally running terasort with one reducer - since no mapred.local.dir had
1T free, the job remained in pending state for several days. The reason for the "stuck" task
wasn't clear from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs and finds
that there isn't enough space.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

Mime
View raw message