spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David McWhorter (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-4069) [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering itself from Yarn RM
Date Wed, 17 Dec 2014 23:07:13 GMT

    [ https://issues.apache.org/jira/browse/SPARK-4069?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14250762#comment-14250762
] 

David McWhorter commented on SPARK-4069:
----------------------------------------

Seeing the same behavior, a spark application fails and the FinalStatus gets set to FAILED
but State hangs in FINISHING.  The application does not release its resources.  It continues
processing for some time in this state and eventually finishes and the State transitions to
FINISHED and the resources are released.  But there is no way to kill the application in this
state and force it to release its executors.
Using Hadoop 2.2.0 and Spark 1.0.1

> [SPARK-YARN] ApplicationMaster should release all executors' containers before unregistering
itself from Yarn RM
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-4069
>                 URL: https://issues.apache.org/jira/browse/SPARK-4069
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.1.0
>            Reporter: Min Zhou
>
> Curently,  ApplciationMaster in yarn mode simply unregister itself from yarn master ,
a.k.a resourcemanager.  Itnever release executors' containers before that.  Yarn's master
will make a decision to kill all the executors' containers if it face such scenario.  so the
log of resourcemanager is like below 
> {noformat}
> 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Processing event for appattempt_1414003182949_0004_000001 of type UNREGISTERED
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1414003182949_0004_000001 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
Updating application application_1414003182949_0004 with final state: FINISHING
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1414003182949_0004 State change from RUNNING to FINAL_SAVING
> 2014-10-22 23:39:09,903 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Processing event for appattempt_1414003182949_0004_000001 of type ATTEMPT_UPDATE_SAVED
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.recovery.RMStateStore:
Storing info for app: application_1414003182949_0004
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1414003182949_0004_000001 State change from FINAL_SAVING to FINISHING
> 2014-10-22 23:39:09,903 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1414003182949_0004 State change from FINAL_SAVING to FINISHING
> 2014-10-22 23:39:10,485 DEBUG org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
Processing event for appattempt_1414003182949_0004_000001 of type CONTAINER_FINISHED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1414003182949_0004_01_000001 Container Transitioned from RUNNING to COMPLETED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ApplicationMasterService:
Unregistering app attempt : appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerApp:
Completed container: container_1414003182949_0004_01_000001 in state: COMPLETED event:FINISHED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
Finish information of container container_1414003182949_0004_01_000001 is written
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.attempt.RMAppAttemptImpl:
appattempt_1414003182949_0004_000001 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=akim	OPERATION=AM Released Container	TARGET=SchedulerApp	RESULT=SUCCESS	APPID=application_1414003182949_0004
CONTAINERID=container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Stored the finish data of container container_1414003182949_0004_01_000001
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedulerNode:
Released container container_1414003182949_0004_01_000001 of capacity <memory:3072, vCores:1>
on host host1, which currently has 0 containers, <memory:0, vCores:0> used and <memory:241901,
vCores:32> available, release resources=true
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.rmapp.RMAppImpl:
application_1414003182949_0004 State change from FINISHING to FINISHED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
Finish information of application attempt appattempt_1414003182949_0004_000001 is written
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=akim	OPERATION=Application Finished - Succeeded	TARGET=RMAppManager	RESULT=SUCCESS	APPID=application_1414003182949_0004
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application attempt appattempt_1414003182949_0004_000001 released container container_1414003182949_0004_01_000001
on node: host: host2:8041 #containers=0 available=<memory:241901, vCores:32> used=<memory:0,
vCores:0> with event: FINISHED
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Stored the finish data of application attempt appattempt_1414003182949_0004_000001
> 2014-10-22 23:39:10,485 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler:
Application appattempt_1414003182949_0004_000001 is done. finalState=FINISHED
> 2014-10-22 23:39:10,486 INFO org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore:
Finish information of application application_1414003182949_0004 is written
> 2014-10-22 23:39:10,486 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_1414003182949_0004_01_000019 Container Transitioned from RUNNING to KILLED
> {noformat}
> Although it won't affect the job's final succeed status, but the log will confuse users.

> If we run a  spark job on yarn 2.4.1 with timeline server enabled,  we will get errors
on the resourcemanager's log
> {noformat}
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000019
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000017
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000009
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000010
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000012
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000003
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000005
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000004
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000015
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000018
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000013
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000008
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000014
> 2014-10-22 23:39:10,637 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000007
> 2014-10-22 23:39:10,638 ERROR org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter:
Error when storing the finish data of container container_1414003182949_0004_01_000002
> {noformat}
> This is because the application is finished before containers being terminated.  Once
the executors' containers being killed,  resourcemanager will try to log something for containers'
finsih event, but can't find a writer due to the application  finished before that.  
> {noformat}
> java.io.IOException: History file of application application_1414003182949_0003 is not
opened
>     org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.getHistoryFileWriter(FileSystemApplicationHistoryStore.java:643)
>     org.apache.hadoop.yarn.server.applicationhistoryservice.FileSystemApplicationHistoryStore.containerFinished(FileSystemApplicationHistoryStore.java:532)
>     org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter.handleWritingApplicationHistoryEvent(RMApplicationHistoryWriter.java:203)
>     org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:297)
>     org.apache.hadoop.yarn.server.resourcemanager.ahs.RMApplicationHistoryWriter$ForwardingEventHandler.handle(RMApplicationHistoryWriter.java:292)
>     org.apache.hadoop.yarn.event.AsyncDispatcher.dispatch(AsyncDispatcher.java:173)
>     org.apache.hadoop.yarn.event.AsyncDispatcher$1.run(AsyncDispatcher.java:106)
>     java.lang.Thread.run(Thread.java:745)
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message