From yarn-issues-return-149231-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Jul 18 00:22:05 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 202E5180600 for ; Wed, 18 Jul 2018 00:22:04 +0200 (CEST) Received: (qmail 82623 invoked by uid 500); 17 Jul 2018 22:22:04 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 82611 invoked by uid 99); 17 Jul 2018 22:22:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 17 Jul 2018 22:22:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id C28951A1447 for ; Tue, 17 Jul 2018 22:22:03 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id mBI2puN-5heQ for ; Tue, 17 Jul 2018 22:22:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id B9E7D5F35A for ; Tue, 17 Jul 2018 22:22:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id AE54EE1219 for ; Tue, 17 Jul 2018 22:22:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 334E323F99 for ; Tue, 17 Jul 2018 22:22:00 +0000 (UTC) Date: Tue, 17 Jul 2018 22:22:00 +0000 (UTC) From: "Eric Payne (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-4606) CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-4606?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16547161#comment-16547161 ] Eric Payne commented on YARN-4606: ---------------------------------- Thank you, [~manirajv06@gmail.com], for the latest patch. The code changes look good. However, I have a couple of points with the tests. - I have a general concern that these tests are not testing the fix to the starvation problem outlined in the description of this JIRA. I'm trying to determine if there is a clean way to unit test that use case. - {TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps}}: I am concerned about new tests that take longer than necessary because the unit tests keep taking longer and longer to run. I think that the following things can be done to reduce this test time (in my build environment) from 1min 17sec to 24 sec. -- In the following code, the sleep(5000) outside of the for loop is not necessary. -- In the following code, the sleep(5000) inside of the for loop could be cut down to sleep(500). {code:title=TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps} Thread.sleep(5000); //Triggering this event so that user limit computation can //happen again for (int i = 0; i < 10; i++) { cs.handle(new NodeUpdateSchedulerEvent(rmNode1)); Thread.sleep(5000); } {code} - {{TestCapacityScheduler#testMoveAppWithActiveUsersWithOnlyPendingApps1}}: I don't think this test is necessary. It takes more than 1:20 to run in my build environment, and as far as I can tell, it is verifying that the active users count is not ever updated after a move if node heartbeats are not received. However, in a running YARN installation, node heartbeats are received every second (by default). Unless I'm missing something, this isn't a use case that one would encounter in a running Hadoop system. - {{TestCapacityScheduler#setupQueueConfigurationForActiveUsersChecks}}: The parameters to {{conf.setUserLimitFactor(...)}} don't need to be 100.0f. User limit factor can be thought of as the multiplier for the amount of a queue that one user can consume. So, if the user limit factor is 1.0f, one user can use the capacity of the queue. If it is 2.0f, one user can use twice the capacity of the queue, and so forth. Since these queues have a capacity of 50%, I would set this to 2.0f. > CapacityScheduler: applications could get starved because computation of #activeUsers considers pending apps > ------------------------------------------------------------------------------------------------------------- > > Key: YARN-4606 > URL: https://issues.apache.org/jira/browse/YARN-4606 > Project: Hadoop YARN > Issue Type: Bug > Components: capacity scheduler, capacityscheduler > Affects Versions: 2.8.0, 2.7.1 > Reporter: Karam Singh > Assignee: Manikandan R > Priority: Critical > Attachments: YARN-4606.001.patch, YARN-4606.002.patch, YARN-4606.003.patch, YARN-4606.004.patch, YARN-4606.005.patch, YARN-4606.006.patch, YARN-4606.1.poc.patch, YARN-4606.POC.2.patch, YARN-4606.POC.3.patch, YARN-4606.POC.patch > > > Currently, if all applications belong to same user in LeafQueue are pending (caused by max-am-percent, etc.), ActiveUsersManager still considers the user is an active user. This could lead to starvation of active applications, for example: > - App1(belongs to user1)/app2(belongs to user2) are active, app3(belongs to user3)/app4(belongs to user4) are pending > - ActiveUsersManager returns #active-users=4 > - However, there're only two users (user1/user2) are able to allocate new resources. So computed user-limit-resource could be lower than expected. -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org