spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-13642) Properly handle signal kill of ApplicationMaster
Date Mon, 14 Mar 2016 08:53:33 GMT

    [ https://issues.apache.org/jira/browse/SPARK-13642?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15192928#comment-15192928
] 

Apache Spark commented on SPARK-13642:
--------------------------------------

User 'jerryshao' has created a pull request for this issue:
https://github.com/apache/spark/pull/11693

> Properly handle signal kill of ApplicationMaster
> ------------------------------------------------
>
>                 Key: SPARK-13642
>                 URL: https://issues.apache.org/jira/browse/SPARK-13642
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.6.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>
> Currently when running Spark on Yarn with yarn cluster mode, the default application
final state is "SUCCEED", if any exception is occurred, this final state will be changed to
"FAILED" and trigger the reattempt if possible. 
> This is OK in normal case, but if there's a race condition when AM received a signal
(SIGTERM) and no any exception is occurred. In this situation, shutdown hook will be invoked
and marked this application as finished with success, and there's no another attempt.
> In such situation, actually from Spark's aspect this application is failed and need another
attempt, but from Yarn's aspect the application is finished with success.
> This could happened in NM failure situation, the failure of NM will send SIGTERM to AM,
AM should mark this attempt as failure and rerun again, not invoke unregister.
> So to increase the chance of this race condition, here is the reproduced code:
> {code}
> val sc = ...
> Thread.sleep(30000L)
> sc.parallelize(1 to 100).collect()
> {code}
> If the AM is failed in sleeping, there's no exception been thrown, so from Yarn's point
this application is finished successfully, but from Spark's point, this application should
be reattempted.
> The log normally like this:
> {noformat}
> 16/03/03 12:44:19 INFO ContainerManagementProtocolProxy: Opening proxy : 192.168.0.105:45454
> 16/03/03 12:44:21 INFO YarnClusterSchedulerBackend: Registered executor NettyRpcEndpointRef(null)
(192.168.0.105:57461) with ID 2
> 16/03/03 12:44:21 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.105:57462
with 511.1 MB RAM, BlockManagerId(2, 192.168.0.105, 57462)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: Registered executor NettyRpcEndpointRef(null)
(192.168.0.105:57467) with ID 1
> 16/03/03 12:44:23 INFO BlockManagerMasterEndpoint: Registering block manager 192.168.0.105:57468
with 511.1 MB RAM, BlockManagerId(1, 192.168.0.105, 57468)
> 16/03/03 12:44:23 INFO YarnClusterSchedulerBackend: SchedulerBackend is ready for scheduling
beginning after reached minRegisteredResourcesRatio: 0.8
> 16/03/03 12:44:23 INFO YarnClusterScheduler: YarnClusterScheduler.postStartHook done
> 16/03/03 12:44:39 ERROR ApplicationMaster: RECEIVED SIGNAL 15: SIGTERM
> 16/03/03 12:44:39 INFO SparkContext: Invoking stop() from shutdown hook
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/metrics/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/kill,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/api,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/static,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/threadDump,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/executors,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/environment,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/rdd,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/storage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/pool,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/stage,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/stages,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/job,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs/json,null}
> 16/03/03 12:44:39 INFO ContextHandler: stopped o.e.j.s.ServletContextHandler{/jobs,null}
> 16/03/03 12:44:39 INFO SparkUI: Stopped Spark web UI at http://192.168.0.105:57452
> 16/03/03 12:44:39 INFO YarnAllocator: Driver requested a total number of 0 executor(s).
> 16/03/03 12:44:39 INFO YarnAllocator: Canceling requests for 0 executor containers
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Shutting down all executors
> 16/03/03 12:44:39 WARN YarnAllocator: Expected to find pending requests, but found none.
> 16/03/03 12:44:39 INFO YarnClusterSchedulerBackend: Asking each executor to shut down
> 16/03/03 12:44:39 INFO SchedulerExtensionServices: Stopping SchedulerExtensionServices
> (serviceOption=None,
>  services=List(),
>  started=false)
> 16/03/03 12:44:39 INFO MapOutputTrackerMasterEndpoint: MapOutputTrackerMasterEndpoint
stopped!
> 16/03/03 12:44:39 INFO MemoryStore: MemoryStore cleared
> 16/03/03 12:44:39 INFO BlockManager: BlockManager stopped
> 16/03/03 12:44:39 INFO BlockManagerMaster: BlockManagerMaster stopped
> 16/03/03 12:44:39 INFO OutputCommitCoordinator$OutputCommitCoordinatorEndpoint: OutputCommitCoordinator
stopped!
> 16/03/03 12:44:39 INFO SparkContext: Successfully stopped SparkContext
> 16/03/03 12:44:39 INFO ApplicationMaster: Final app status: SUCCEEDED, exitCode: 0, (reason:
Shutdown hook called before final status was reported.)
> 16/03/03 12:44:39 INFO ApplicationMaster: Unregistering ApplicationMaster with SUCCEEDED
(diag message: Shutdown hook called before final status was reported.)
> 16/03/03 12:44:39 INFO AMRMClientImpl: Waiting for application to be successfully unregistered.
> 16/03/03 12:44:39 INFO ApplicationMaster: Deleting staging directory .sparkStaging/application_1456975940304_0004
> 16/03/03 12:44:40 INFO ShutdownHookManager: Shutdown hook called
> {noformat}
> So basically, I think only after the finish of user class, we could mark this application
as "SUCCESS", otherwise, especially in the signal stopped scenario, it would be better to
mark as failed and try again (except explicitly KILL command by yarn).



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message