hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weiwei Yang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-5937) stop-yarn.sh is not able to gracefully stop node managers
Date Mon, 28 Nov 2016 16:46:58 GMT
Weiwei Yang created YARN-5937:
---------------------------------

             Summary: stop-yarn.sh is not able to gracefully stop node managers
                 Key: YARN-5937
                 URL: https://issues.apache.org/jira/browse/YARN-5937
             Project: Hadoop YARN
          Issue Type: Bug
            Reporter: Weiwei Yang
            Assignee: Weiwei Yang


stop-yarn.sh always gives following output

{code}
./sbin/stop-yarn.sh
Stopping resourcemanager
Stopping nodemanagers
<NM_HOST>: WARNING: nodemanager did not stop gracefully after 5 seconds: Trying to kill
with kill -9
oracle1.fyre.ibm.com: ERROR: Unable to kill 18097
{code}

this was because resource manager is stopped before node managers, when the shutdown hook
manager tries to gracefully stop NM services, NM needs to unregister with RM, and it gets
timeout as NM could not connect to RM (already stopped). See log (stop RM then run kill <nm_pid>)

{code}
16/11/28 08:26:43 ERROR nodemanager.NodeManager: RECEIVED SIGNAL 15: SIGTERM
...
16/11/28 08:26:53 WARN util.ShutdownHookManager: ShutdownHook 'CompositeServiceShutdownHook'
timeout, java.util.concurrent.TimeoutException
java.util.concurrent.TimeoutException
	at java.util.concurrent.FutureTask.get(FutureTask.java:205)
	at org.apache.hadoop.util.ShutdownHookManager$1.run(ShutdownHookManager.java:67)
...
	at org.apache.hadoop.yarn.server.nodemanager.NodeStatusUpdaterImpl.unRegisterNM(NodeStatusUpdaterImpl.java:291)
...
16/11/28 08:27:13 ERROR util.ShutdownHookManager: ShutdownHookManger shutdown forcefully.
{code}

the shutdown hooker has a default of 10s timeout, so if RM is stopped before NMs, they always
took more than 10s to stop (in java code). However stop-yarn.sh only gives 5s timeout, so
NM is always killed instead of stopped.

It would make sense to stop NMs before RMs in this script, in a graceful way.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message