hadoop-mapreduce-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Benjamin Tortorelli (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (MAPREDUCE-6302) deadlock in a job between map and reduce cores allocation
Date Tue, 28 Apr 2015 18:52:09 GMT

    [ https://issues.apache.org/jira/browse/MAPREDUCE-6302?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14517651#comment-14517651
] 

Benjamin Tortorelli commented on MAPREDUCE-6302:
------------------------------------------------

We're seeing this issue as well. Although our job is map only. Some runs seems to hang and
have to be killed, others only take a very long amount of time to complete. This occurs with
varying numbers of workers and memory. Yarn logs for the job always show one worker with an
extremely large log file compared to the other workers (50 MB vs 500 KB).

> deadlock in a job between map and reduce cores allocation 
> ----------------------------------------------------------
>
>                 Key: MAPREDUCE-6302
>                 URL: https://issues.apache.org/jira/browse/MAPREDUCE-6302
>             Project: Hadoop Map/Reduce
>          Issue Type: Bug
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Assignee: Karthik Kambatla
>            Priority: Critical
>         Attachments: AM_log_head100000.txt.gz, AM_log_tail100000.txt.gz, log.txt, mr-6302-prelim.patch,
queue_with_max163cores.png, queue_with_max263cores.png, queue_with_max333cores.png
>
>
> I submit a  big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with
300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied
300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the
300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job
is blocked, and the later job in the queue cannot run because no available cores in the queue.
> I think there is the similar issue for memory of a queue .



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message