hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (YARN-3742) YARN RM will shut down if ZKClient creation times out
Date Sat, 04 Mar 2017 18:57:46 GMT

     [ https://issues.apache.org/jira/browse/YARN-3742?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel

Daniel Templeton updated YARN-3742:
    Attachment: YARN-3742.001.patch

Hmmm...  This is a little more complicated than I thought.  Looking at YARN-2759 and YARN-2814,
it looks like the original plan was to remove the {{RMFatalEventDispatcher}} entirely, in
favor of distributing the job of transitioning the RM or shutting it down to all the the various
places where that could be triggered.  I disagree, though.  I think it makes a lot more sense
to have the RM be in control of its own destiny and force other services to trigger the RM's
state change through the dispatcher.

The original issue that YARN-2759 was trying to solve was a deadlock caused by the {{RMFatalEventDispatcher}}
trying to transition to standby.  The solution was to move the transition into a thread *and*
move that thread into the state store.  Just moving the transition into a thread would have
been enough to solve the problem.  I've attached a patch that restores the {{RMFatalEventDispatcher}}
to its original role in a way that doesn't cause a deadlock.  All tests from YARN-2759 and
subsequent patches pass, so I think we're safe.  As an added bonus, the RM will now not die
spontaneously just because something unexpected happened.

[~kasha], as the author of YARN-2814, I would love to hear your feedback.

> YARN RM  will shut down if ZKClient creation times out 
> -------------------------------------------------------
>                 Key: YARN-3742
>                 URL: https://issues.apache.org/jira/browse/YARN-3742
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>    Affects Versions: 2.7.0
>            Reporter: Wilfred Spiegelenburg
>            Assignee: Daniel Templeton
>         Attachments: YARN-3742.001.patch
> The RM goes down showing the following stacktrace if the ZK client connection fails to
be created. We should not exit but transition to StandBy and stop doing things and let the
other RM take over.
> {code}
> 2015-04-19 01:22:20,513  FATAL org.apache.hadoop.yarn.server.resourcemanager.ResourceManager:
Received a org.apache.hadoop.yarn.server.resourcemanager.RMFatalEvent of type STATE_STORE_OP_FAILED.
> java.io.IOException: Wait for ZKClient creation timed out
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithCheck(ZKRMStateStore.java:1066)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore$ZKAction.runWithRetries(ZKRMStateStore.java:1090)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.existsWithRetries(ZKRMStateStore.java:996)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.ZKRMStateStore.updateApplicationStateInternal(ZKRMStateStore.java:643)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:162)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$UpdateAppTransition.transition(RMStateStore.java:147)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$SingleInternalArc.doTransition(StateMachineFactory.java:362)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.doTransition(StateMachineFactory.java:302)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory.access$300(StateMachineFactory.java:46)
> 	at org.apache.hadoop.yarn.state.StateMachineFactory$InternalStateMachine.doTransition(StateMachineFactory.java:448)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore.handleStoreEvent(RMStateStore.java:806)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:879)
> 	at org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore$ForwardingEventHandler.handle(RMStateStore.java:874)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
> 	at org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
> 	at java.lang.Thread.run(Thread.java:745)
> {code}

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message