Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id D4E25200CED for ; Fri, 18 Aug 2017 19:56:08 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id D362C16D246; Fri, 18 Aug 2017 17:56:08 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 0722C16D240 for ; Fri, 18 Aug 2017 19:56:07 +0200 (CEST) Received: (qmail 4692 invoked by uid 500); 18 Aug 2017 17:56:05 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 4681 invoked by uid 99); 18 Aug 2017 17:56:05 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd1-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 18 Aug 2017 17:56:05 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd1-us-west.apache.org (ASF Mail Server at spamd1-us-west.apache.org) with ESMTP id 7D014C12E5 for ; Fri, 18 Aug 2017 17:56:05 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd1-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -99.202 X-Spam-Level: X-Spam-Status: No, score=-99.202 tagged_above=-999 required=6.31 tests=[KAM_ASCII_DIVIDERS=0.8, RP_MATCHES_RCVD=-0.001, SPF_PASS=-0.001, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd1-us-west.apache.org [10.40.0.7]) (amavisd-new, port 10024) with ESMTP id nPSus0LSfUro for ; Fri, 18 Aug 2017 17:56:04 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id A50D55FE16 for ; Fri, 18 Aug 2017 17:56:03 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D04D2E0A32 for ; Fri, 18 Aug 2017 17:56:02 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id BAE4C2538F for ; Fri, 18 Aug 2017 17:56:00 +0000 (UTC) Date: Fri, 18 Aug 2017 17:56:00 +0000 (UTC) From: "Jason Lowe (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Updated] (YARN-2416) InvalidStateTransitonException in ResourceManager if AMLauncher does not receive response for startContainers() call in time MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 18 Aug 2017 17:56:09 -0000 [ https://issues.apache.org/jira/browse/YARN-2416?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ] Jason Lowe updated YARN-2416: ----------------------------- Priority: Critical (was: Major) Bumping the priority of this since easily leads to infinite job hangs: # AM races ahead of launch event and tries to register. # ApplicationMasterService ends up completing the registration from its point of view and fires event to app state machine # App state machine still thinks it hasn't launched yet, so it throws the stacktrace listed above and remains in the ALLOCATED state. # Eventually launch even arrives and we move to the LAUNCHED state. # We can never move to the RUNNING state since the app and ApplicationMasterService both think we've registered but the app state machine does not. # App tries to unregister. Since ApplicationMasterService thinks it has registered, it tickles the expiration watchdog and sends the event to the app state machine. App is told unregister was false because it hasn't been saved in a terminal state # App state machine barfs and ignores unregister event since app is not RUNNING (and never will be) # Rinse, repeat unregistration forever until app is manually killed. The RM will never expire the app because ApplicationMasterService keeps tickling the watchdog on each unregister attempt. bq. Now AM.register and Event.Launched is asynchronous. If need to change it to be synchronous? This is not easy to do. There could be a network cut between the RM and the NM or a large GC event just as the NM tries to send the RPC response to the RM, but meanwhile the AM proceeds to launch and ends up registering to the RM just as the NM RPC response arrives. The only way to enforce ordering there is to temporarily reject AM registrations until the state machine has seen the launch event. I'd much prefer to just fix the state machine to expect these two events to race inherently as originally suggested. > InvalidStateTransitonException in ResourceManager if AMLauncher does not receive response for startContainers() call in time > ---------------------------------------------------------------------------------------------------------------------------- > > Key: YARN-2416 > URL: https://issues.apache.org/jira/browse/YARN-2416 > Project: Hadoop YARN > Issue Type: Bug > Components: resourcemanager > Affects Versions: 2.4.0 > Reporter: Jian Fang > Assignee: Jonathan Eagles > Priority: Critical > > AMLauncher calls startContainers(allRequests) to launch a container for application master. Normally, the call comes back immediately so that the RMAppAttempt changes its state from ALLOCATED to LAUNCHED. > However, we do observed that in some cases, the RPC call came back very late but the AM container was already started. Because the RMAppAttempt stuck in ALLOCATED state, once resource manager received the REGISTERED event from the application master, it threw InvalidStateTransitonException as follows. > 2014-07-05 08:59:05,021 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: REGISTERED at ALLOCATED > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > For subsequent STATUS_UPDATE and CONTAINER_ALLOCATED events for this job, resource manager kept throwing InvalidStateTransitonException. > 2014-07-05 08:59:06,152 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: STATUS_UPDATE at ALLOCATED > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) > 2014-07-05 08:59:07,779 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl: container_1404549222428_0001_02_000002 Container Transitioned from NEW to > ALLOCATED > 2014-07-05 08:59:07,779 ERROR org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl: Can't handle this event at current state > org.apache.hadoop.yarn.state.InvalidStateTransitonException: Invalid event: CONTAINER_ALLOCATED at ALLOCATED > at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:305) > at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46) > at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:652) > at org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl.handle(RMAppAttemptImpl.java:106) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:752) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$ApplicationAttemptEventDispatcher.handle(ResourceManager.java:733) > at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173) > at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106) > at java.lang.Thread.run(Thread.java:744) -- This message was sent by Atlassian JIRA (v6.4.14#64029) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org