hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rohith (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3416) deadlock in a job between map and reduce cores allocation
Date Wed, 01 Apr 2015 05:12:53 GMT

    [ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14389999#comment-14389999

Rohith commented on YARN-3416:

bq. there are only 4 NodeManagers in cluster, so it is possible all 4 NodeManagers are in
the blacklist
In yarn-1680, all the NM's were not in the blacklist. Only 1 NM is blacklisted. This scenario
can happen in larger cluster also. I have observed similar issue in 25 nodes cluster also.
     The reason for suspect would be same as yarn-1680 is in your cluster 300 reducers are
running which occupied 300 cores. It means there is no place for running mappers. But at this
moment if any reducers dont get mapper output(any reason), the mapper is marked as failure
and the nodes are blacklisted. Blacklisted nodes has some resources which can run some of
the containers. In MR, reducer preempt is decided on several factors, out of that one factoe
is headroom. But RM sends headroom considering blacklisted nodes which causes MR not to trigger
reducer preemption. I am suspecting only. There would be real potential hidden bug also. If
you provide full AM logs, I can help you in analyzing whether it is same as yarn-1680 or not.

> deadlock in a job between map and reduce cores allocation 
> ----------------------------------------------------------
>                 Key: YARN-3416
>                 URL: https://issues.apache.org/jira/browse/YARN-3416
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: fairscheduler
>    Affects Versions: 2.6.0
>            Reporter: mai shurong
>            Priority: Critical
> I submit a  big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with
300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied
300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the
300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job
is blocked, and the later job in the queue cannot run because no available cores in the queue.
> I think there is the similar issue for memory of a queue .

This message was sent by Atlassian JIRA

View raw message