Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 4391E10C6B for ; Sat, 6 Jun 2015 18:53:01 +0000 (UTC) Received: (qmail 39306 invoked by uid 500); 6 Jun 2015 18:53:01 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 39255 invoked by uid 500); 6 Jun 2015 18:53:01 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: yarn-issues@hadoop.apache.org Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 38979 invoked by uid 99); 6 Jun 2015 18:53:01 -0000 Received: from arcas.apache.org (HELO arcas.apache.org) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Sat, 06 Jun 2015 18:53:01 +0000 Date: Sat, 6 Jun 2015 18:53:01 +0000 (UTC) From: "zhihai xu (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-3655) FairScheduler: potential livelock due to maxAMShare limitation and container reservation MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-3655?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] zhihai xu updated YARN-3655: ---------------------------- Attachment: YARN-3655.004.patch > FairScheduler: potential livelock due to maxAMShare limitation and container reservation > ----------------------------------------------------------------------------------------- > > Key: YARN-3655 > URL: https://issues.apache.org/jira/browse/YARN-3655 > Project: Hadoop YARN > Issue Type: Bug > Components: fairscheduler > Affects Versions: 2.7.0 > Reporter: zhihai xu > Assignee: zhihai xu > Attachments: YARN-3655.000.patch, YARN-3655.001.patch, YARN-3655.002.patch, YARN-3655.003.patch, YARN-3655.004.patch > > > FairScheduler: potential livelock due to maxAMShare limitation and container reservation. > If a node is reserved by an application, all the other applications don't have any chance to assign a new container on this node, unless the application which reserves the node assigns a new container on this node or releases the reserved container on this node. > The problem is if an application tries to call assignReservedContainer and fail to get a new container due to maxAMShare limitation, it will block all other applications to use the nodes it reserves. If all other running applications can't release their AM containers due to being blocked by these reserved containers. A livelock situation can happen. > The following is the code at FSAppAttempt#assignContainer which can cause this potential livelock. > {code} > // Check the AM resource usage for the leaf queue > if (!isAmRunning() && !getUnmanagedAM()) { > List ask = appSchedulingInfo.getAllResourceRequests(); > if (ask.isEmpty() || !getQueue().canRunAppAM( > ask.get(0).getCapability())) { > if (LOG.isDebugEnabled()) { > LOG.debug("Skipping allocation because maxAMShare limit would " + > "be exceeded"); > } > return Resources.none(); > } > } > {code} > To fix this issue, we can unreserve the node if we can't allocate the AM container on the node due to Max AM share limitation and the node is reserved by the application. -- This message was sent by Atlassian JIRA (v6.3.4#6332)