hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Robert Joseph Evans (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-2324) Job should fail if a reduce task can't be scheduled anywhere
Date Mon, 01 Aug 2011 14:18:09 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-2324?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13073550#comment-13073550

Robert Joseph Evans commented on MAPREDUCE-2324:

I did initially look at trying to fix reduce.input.limit.  Currently it is something that
someone has to manually guess what the value should be.  What is more this value is likely
to need to change as the cluster fills up with data, and as data is deleted off of the cluster.
 If it is wrong then either too many jobs fail that would have succeeded or some jobs, probably
a very small number, will starve and never finish.  

To fix it Hadoop would have to automatically set reduce.input.limit dynamically and the only
way I can think of to do that would be to gather statistics about all the nodes in the cluster
and try to predict how likely this particular reduce will ever find the space it needs on
a node.  I believe that we can compute the mean and X% confidence interval for disk space
on the cluster without too much difficulty but I have my doubts that this will apply to a
small cluster.  From what I read anything under 40 samples tends to be suspect, so it might
not work for a cluster under 40 nodes.  Also I am not sure how the statistics would apply
to this particular situation.  Would we want to compute this based off of a recent history
of cluster of just a snapshot of its current state? If there is history how far back would
we want to go, and how would we handle some nodes heart-beating in more regularly then others.
 I am not a statistician and I could not find one to look over my work,  so instead I decided
to take a bit more of a brute force approach that I know would work.

If you know a statistician that could provide a robust solution to this problem or at least
tell me what if anything I am doing wrong then I am very happy to implement it.

> Job should fail if a reduce task can't be scheduled anywhere
> ------------------------------------------------------------
>                 Key: MAPREDUCE-2324
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-2324
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 0.20.2,
>            Reporter: Todd Lipcon
>            Assignee: Robert Joseph Evans
>             Fix For:
>         Attachments: MR-2324-security-v1.txt, MR-2324-security-v2.txt, MR-2324-security-v3.patch,
> If there's a reduce task that needs more disk space than is available on any mapred.local.dir
in the cluster, that task will stay pending forever. For example, we produced this in a QA
cluster by accidentally running terasort with one reducer - since no mapred.local.dir had
1T free, the job remained in pending state for several days. The reason for the "stuck" task
wasn't clear from a user perspective until we looked at the JT logs.
> Probably better to just fail the job if a reduce task goes through all TTs and finds
that there isn't enough space.

This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira


View raw message