From yarn-issues-return-156622-apmail-hadoop-yarn-issues-archive=hadoop.apache.org@hadoop.apache.org Tue Oct 30 07:57:02 2018 Return-Path: X-Original-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Delivered-To: apmail-hadoop-yarn-issues-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id B46B018BE1 for ; Tue, 30 Oct 2018 07:57:02 +0000 (UTC) Received: (qmail 38865 invoked by uid 500); 30 Oct 2018 07:57:02 -0000 Delivered-To: apmail-hadoop-yarn-issues-archive@hadoop.apache.org Received: (qmail 38813 invoked by uid 500); 30 Oct 2018 07:57:02 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 38801 invoked by uid 99); 30 Oct 2018 07:57:02 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd2-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Tue, 30 Oct 2018 07:57:02 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd2-us-west.apache.org (ASF Mail Server at spamd2-us-west.apache.org) with ESMTP id E351E1A5CBC for ; Tue, 30 Oct 2018 07:57:01 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd2-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.501 X-Spam-Level: X-Spam-Status: No, score=-109.501 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd2-us-west.apache.org [10.40.0.9]) (amavisd-new, port 10024) with ESMTP id i7HsX8EN5W3d for ; Tue, 30 Oct 2018 07:57:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id C52C35F358 for ; Tue, 30 Oct 2018 07:57:00 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 56207E006D for ; Tue, 30 Oct 2018 07:57:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 1F61327764 for ; Tue, 30 Oct 2018 07:57:00 +0000 (UTC) Date: Tue, 30 Oct 2018 07:57:00 +0000 (UTC) From: "Tao Yang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (YARN-8958) Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-8958?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16668305#comment-16668305 ] Tao Yang commented on YARN-8958: -------------------------------- Attached v1 patch for review. > Schedulable entities leak in fair ordering policy when recovering containers between remove app attempt and remove app > ---------------------------------------------------------------------------------------------------------------------- > > Key: YARN-8958 > URL: https://issues.apache.org/jira/browse/YARN-8958 > Project: Hadoop YARN > Issue Type: Bug > Components: capacityscheduler > Affects Versions: 3.2.1 > Reporter: Tao Yang > Assignee: Tao Yang > Priority: Major > Attachments: YARN-8958.001.patch > > > We found a NPE in ClientRMService#getApplications when querying apps with specified queue. The cause is that there is one app which can't be found by calling RMContextImpl#getRMApps(is finished and swapped out of memory) but still can be queried from fair ordering policy. > To reproduce schedulable entities leak in fair ordering policy: > (1) create app1 and launch container1 on node1 > (2) restart RM > (3) remove app1 attempt, app1 is removed from the schedulable entities. > (4) recover container1, then the state of contianer1 is changed to COMPLETED, app1 is bring back to entitiesToReorder after container released, then app1 will be added back into schedulable entities after calling FairOrderingPolicy#getAssignmentIterator by scheduler. > (5) remove app1 > To solve this problem, we should make sure schedulableEntities can only be affected by add or remove app attempt, new entity should not be added into schedulableEntities by reordering process. > {code:java} > protected void reorderSchedulableEntity(S schedulableEntity) { > //remove, update comparable data, and reinsert to update position in order > schedulableEntities.remove(schedulableEntity); > updateSchedulingResourceUsage( > schedulableEntity.getSchedulingResourceUsage()); > schedulableEntities.add(schedulableEntity); > } > {code} > Related codes above can be improved as follow to make sure only existent entity can be re-add into schedulableEntities. > {code:java} > protected void reorderSchedulableEntity(S schedulableEntity) { > //remove, update comparable data, and reinsert to update position in order > boolean exists = schedulableEntities.remove(schedulableEntity); > updateSchedulingResourceUsage( > schedulableEntity.getSchedulingResourceUsage()); > if (exists) { > schedulableEntities.add(schedulableEntity); > } else { > LOG.info("Skip reordering non-existent schedulable entity: " > + schedulableEntity.getId()); > } > } > {code} -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org