hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wangda Tan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6689) MapReduce job can infinitely increase number of reducer resource requests
Date Thu, 05 May 2016 18:18:12 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6689?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272774#comment-15272774

Wangda Tan commented on MAPREDUCE-6689:

Thanks [~haibochen] pointing MAPREDUCE-6514.

MAPREDUCE-6514 is one of cause of this problem, but it causes big trouble after MAPREDUCE-6302

Offline discussed with [~varun_saxena], I will rebase & upload a patch to MAPREDUCE-6514
later. And for this JIRA, I will fix cancel all then add all reducer requests in this JIRA.

Application log is available at: https://www.dropbox.com/s/ckx1z993lt4ymh2/app.log.zip?dl=0.
(It is too large to be uploaded to JIRA)

> MapReduce job can infinitely increase number of reducer resource requests
> -------------------------------------------------------------------------
>                 Key: MAPREDUCE-6689
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6689
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>            Reporter: Wangda Tan
>            Assignee: Wangda Tan
>            Priority: Blocker
> We have seen this issue from one of our clusters: when running terasort map-reduce job,
some mappers failed after reducer started, and then MR AM tries to preempt reducers to schedule
these failed mappers.
> After that, MR AM enters an infinite loop, for every RMContainerAllocator#heartbeat run,
> - In {{preemptReducesIfNeeded}}, it cancels all scheduled reducer requests. (total scheduled
reducers = 1024)
> - Then, in {{scheduleReduces}}, it ramps up all reducers (total = 1024).
> As a result, we can see total #requested-containers increased 1024 for every MRAM-RM
heartbeat (1 sec per heartbeat). The AM is hanging for 18+ hours, so we get 18 * 3600 * 1024
~ 66M+ requested containers in RM side.
> And this bug also triggered YARN-4844, which makes RM stop scheduling anything.
> Thanks to [~sidharta-s] for helping with analysis. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: mapreduce-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: mapreduce-issues-help@hadoop.apache.org

View raw message