Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 6D089200CC8 for ; Fri, 14 Jul 2017 18:11:06 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 6B74A16DD62; Fri, 14 Jul 2017 16:11:06 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id B140116DD5F for ; Fri, 14 Jul 2017 18:11:05 +0200 (CEST) Received: (qmail 85808 invoked by uid 500); 14 Jul 2017 16:11:04 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 85797 invoked by uid 99); 14 Jul 2017 16:11:04 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jul 2017 16:11:04 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id 5E47A1806E1 for ; Fri, 14 Jul 2017 16:11:04 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-eu.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id kQ2xP2GszA1O for ; Fri, 14 Jul 2017 16:11:02 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-eu.apache.org (ASF Mail Server at mx1-lw-eu.apache.org) with ESMTP id 80CF25FCB0 for ; Fri, 14 Jul 2017 16:11:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id ABB89E0D3C for ; Fri, 14 Jul 2017 16:11:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 2B00924748 for ; Fri, 14 Jul 2017 16:11:00 +0000 (UTC) Date: Fri, 14 Jul 2017 16:11:00 +0000 (UTC) From: "Tao Yang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-6714) IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 14 Jul 2017 16:11:06 -0000 [ https://issues.apache.org/jira/browse/YARN-6714?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Tao Yang updated YARN-6714: --------------------------- Attachment: YARN-6714.branch-2.006.patch > IllegalStateException while handling APP_ATTEMPT_REMOVED event when async-scheduling enabled in CapacityScheduler > ----------------------------------------------------------------------------------------------------------------- > > Key: YARN-6714 > URL: https://issues.apache.org/jira/browse/YARN-6714 > Project: Hadoop YARN > Issue Type: Bug > Affects Versions: 2.9.0, 3.0.0-alpha3 > Reporter: Tao Yang > Assignee: Tao Yang > Attachments: YARN-6714.001.patch, YARN-6714.002.patch, YARN-6714.003.patch, YARN-6714.branch-2.003.patch, YARN-6714.branch-2.004.patch, YARN-6714.branch-2.005.patch, YARN-6714.branch-2.006.patch > > > Currently in async-scheduling mode of CapacityScheduler, after AM failover and unreserve all reserved containers, it still have chance to get and commit the outdated reserve proposal of the failed app attempt. This problem happened on an app in our cluster, when this app stopped, it unreserved all reserved containers and compared these appAttemptId with current appAttemptId, if not match it will throw IllegalStateException and make RM crashed. > Error log: > {noformat} > 2017-06-08 11:02:24,339 FATAL [ResourceManager Event Processor] org.apache.hadoop.yarn.server.resourcemanager.ResourceManager: Error in handling event type APP_ATTEMPT_REMOVED to the scheduler > java.lang.IllegalStateException: Trying to unreserve for application appattempt_1495188831758_0121_000002 when currently reserved for application application_1495188831758_0121 on node host: node1:45454 #containers=2 available=... used=... > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerNode.unreserveResource(FiCaSchedulerNode.java:123) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp.unreserve(FiCaSchedulerApp.java:845) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue.completedContainer(LeafQueue.java:1787) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.completedContainerInternal(CapacityScheduler.java:1957) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnScheduler.completedContainer(AbstractYarnScheduler.java:586) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.doneApplicationAttempt(CapacityScheduler.java:966) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:1740) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler.handle(CapacityScheduler.java:152) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:822) > at java.lang.Thread.run(Thread.java:834) > {noformat} > When async-scheduling enabled, CapacityScheduler#doneApplicationAttempt and CapacityScheduler#tryCommit both need to get write_lock before executing, so we can check the app attempt state in commit process to avoid committing outdated proposals. -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org