Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 46D78200BBD for ; Tue, 25 Oct 2016 06:29:00 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 4574C160B00; Tue, 25 Oct 2016 04:29:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 8A588160AEB for ; Tue, 25 Oct 2016 06:28:59 +0200 (CEST) Received: (qmail 92372 invoked by uid 500); 25 Oct 2016 04:28:58 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 92357 invoked by uid 99); 25 Oct 2016 04:28:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 25 Oct 2016 04:28:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 611A52C2A68 for ; Tue, 25 Oct 2016 04:28:58 +0000 (UTC) Date: Tue, 25 Oct 2016 04:28:58 +0000 (UTC) From: "Sunil G (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-5773) RM recovery too slow due to LeafQueue#activateApplication() MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Tue, 25 Oct 2016 04:29:00 -0000 [ https://issues.apache.org/jira/browse/YARN-5773?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15604144#comment-15604144 ] Sunil G commented on YARN-5773: ------------------------------- *Issues in Recovery of apps:* 1. activateApplications works under a write lock. 2. If one application is found of overflowing AM resource limit, instead of breaking from loop, we continue and play complete apps from pendingOrderingPolicy. We may need to iterate all apps because we have apps belongs to different partition and pendingOrderingPolicy does not provide any order for apps based on partition. 3. As mentioned by [~bibinchundatt], when each app fails to get activated due to the upper cut of resource limit, one INFO log is emitted. During recovery, this is costly. [~leftnoteasy] and [~rohithsharma] bq.If a given app's AM resource amount > AM headroom, should we skip the AM and activate following app which AM resource amount <= AM headroom? bq.But one point to be considered is for each Node registration, head room changes. So, user head room changes as new node registered. This need to be taken care. Currently activateApplications is invoked when there is a change in cluster resource. So any change in cluster resource will ensure a call to activateApplications and we can recalculate this headroom. I am not very sure about the suggested map. Will this check be coming before we do the existing AM resource percentage check for queue/partition (not user based) ? OR are we replacing this checks? > RM recovery too slow due to LeafQueue#activateApplication() > ----------------------------------------------------------- > > Key: YARN-5773 > URL: https://issues.apache.org/jira/browse/YARN-5773 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Bibin A Chundatt > Assignee: Bibin A Chundatt > Priority: Critical > Attachments: YARN-5773.0001.patch, YARN-5773.0002.patch > > > # Submit application 10K application to default queue. > # All applications are in accepted state > # Now restart resourcemanager > For each application recovery {{LeafQueue#activateApplications()}} is invoked.Resulting in AM limit check to be done even before Node managers are getting registered. > Total iteration for N application is about {{N(N+1)/2}} for {{10K}} application {{50000000}} iterations causing time take for Rm to be active more than 10 min. > Since NM resources are not yet added to during recovery we should skip {{activateApplicaiton()}} -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org