From yarn-issues-return-140986-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Thu Mar 29 11:19:04 2018 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 8840518EFF for ; Thu, 29 Mar 2018 11:19:04 +0000 (UTC) Received: (qmail 54441 invoked by uid 500); 29 Mar 2018 11:19:04 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 54403 invoked by uid 500); 29 Mar 2018 11:19:04 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 54392 invoked by uid 99); 29 Mar 2018 11:19:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 29 Mar 2018 11:19:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id C6343180657 for ; Thu, 29 Mar 2018 11:19:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.511 X-Spam-Level: X-Spam-Status: No, score=-109.511 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, T_RP_MATCHES_RCVD=-0.01, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id FMv98bJYLyCo for ; Thu, 29 Mar 2018 11:19:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id DC9FF5FAC2 for ; Thu, 29 Mar 2018 11:19:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id E9D91E0DEF for ; Thu, 29 Mar 2018 11:19:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 6318C255F3 for ; Thu, 29 Mar 2018 11:19:00 +0000 (UTC) Date: Thu, 29 Mar 2018 11:19:00 +0000 (UTC) From: "Manikandan R (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16418782#comment-16418782 ] Manikandan R commented on YARN-4606: ------------------------------------ [~eepayne] Thanks for your detailed explanation. Sorry for the delay. {quote}In this scenario, User2 wants to start App2 but User1 is consuming all resources in the queue with App1. When App1 releases a resource, however, it is not given to App2. The resource is given back to App1, which brings its Pending value down to 19. This is incorrect behavior since Queue1 has room for 2 AMs.{quote} I was trying to understand this behaviour in current code (without my patch) and come to know that AM container is being allocated to App2 only after App1 completion when cluster is running full. In my single node pseudo setup, total cluster resources is 8192M, 8 vcores, only 1 queue (default) with 100% allocation and max am resources is 2048MB, 2 vcores as max am resource percent is 0.2. I submitted an app (say App1) through DS with num_containers as 20. While App1 is running and its pending containers is around 15, submitted second app (say App2) with num_containers as 10. I can see AM container for App2 is being allocated only after App1 completion, which is not in line with your earlier comments. Am I missing anything here? {quote}However, I'm not sure of the best way to get the values for a queue's Used AM Resources and Max AM Resources from this context. Those may be capacity scheduler-specific values. {quote} Yes. But I do see some equivalents available in {{FSQueueMetrics}}. > CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Wangda Tan > Priority: Critical > Attachments: YARN-4606.1.poc.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending (caused by max-am-percent, etc.), ActiveUsersManager still considers the user is an active user. This could lead to starvation of active applications, for example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org