hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Anubhav Dhoot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6302) Incorrect headroom can lead to a deadlock between map and reduce allocations
Date Thu, 01 Oct 2015 21:35:28 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14940453#comment-14940453

Anubhav Dhoot commented on MAPREDUCE-6302:

In the old code we are not preempting if either the headroom or the assigned maps are enough
to run a mapper. So the early out is consistent with the old preemption. But the new preemption
does not have to have the same conditions. 
Since we are using it as a way to come out of deadlocks, I would think preempting irrespective
of how many mappers are running is 
(a) safer and simpler to reason since it is only time based - We do not have to second guess
if we are missing some other reasons for deadlock apart from incorrect headroom.
(b) better in terms of overall throughput for cases as Jason mentioned. 
Having a large timeout is the safety lever for controlling the aggressiveness of the preemption.
Factoring in slow start in a subsequent jira seems like a good idea to me. I can think of
reasons not to factor it in but leave it only as a heuristic to start reducers.

> Incorrect headroom can lead to a deadlock between map and reduce allocations 
> -----------------------------------------------------------------------------
>                 Key: MAPREDUCE-6302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: AM_log_head100000.txt.gz, AM_log_tail100000.txt.gz, log.txt, mr-6302-1.patch,
mr-6302-2.patch, mr-6302-3.patch, mr-6302-4.patch, mr-6302-prelim.patch, queue_with_max163cores.png,
queue_with_max263cores.png, queue_with_max333cores.png
> I submit a  big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with
300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied
300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the
300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job
is blocked, and the later job in the queue cannot run because no available cores in the queue.
> I think there is the similar issue for memory of a queue .

This message was sent by Atlassian JIRA

View raw message