Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 31B4317469 for ; Mon, 30 Mar 2015 06:03:54 +0000 (UTC) Received: (qmail 49558 invoked by uid 500); 30 Mar 2015 06:03:54 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 49518 invoked by uid 500); 30 Mar 2015 06:03:54 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 49501 invoked by uid 99); 30 Mar 2015 06:03:53 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 30 Mar 2015 06:03:53 +0000 Date: Mon, 30 Mar 2015 06:03:53 +0000 (UTC) From: "mai shurong (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-3416) deadlock in a job between map and reduce cores allocation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3416?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14386244#comment-14386244 ] mai shurong commented on YARN-3416: ----------------------------------- In YARN-1680,there are only 4 NodeManagers in cluster, so it is possible all 4 NodeManagers are in the blacklist. But in my case, there are more than 50 NodeManagers and over 1000 vcores in cluster. Therefore, it is hardly probable all NodeManagers in cluster are in blacklist. > deadlock in a job between map and reduce cores allocation > ---------------------------------------------------------- > > Key: YARN-3416 > URL: https://issues.apache.org/jira/browse/YARN-3416 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.6.0 > Reporter: mai shurong > > I submit a big job, which has 500 maps and 350 reduce, to a queue(fairscheduler) with 300 max cores. When the big mapreduce job is running 100% maps, the 300 reduces have occupied 300 max cores in the queue. And then, a map fails and retry, waiting for a core, while the 300 reduces are waiting for failed map to finish. So a deadlock occur. As a result, the job is blocked, and the later job in the queue cannot run because no available cores in the queue. > I think there is the similar issue for memory of a queue . -- This message was sent by Atlassian JIRA (v6.3.4#6332)