spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Weizhong (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-14527) Job can't finish when restart all nodemanages with using external shuffle services
Date Mon, 11 Apr 2016 07:47:25 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14527?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Weizhong updated SPARK-14527:
-----------------------------
    Summary: Job can't finish when restart all nodemanages with using external shuffle services
 (was: Job can't finish when restart all nodemanage when using external shuffle services)

> Job can't finish when restart all nodemanages with using external shuffle services
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14527
>                 URL: https://issues.apache.org/jira/browse/SPARK-14527
>             Project: Spark
>          Issue Type: Bug
>          Components: Shuffle, Spark Core, YARN
>            Reporter: Weizhong
>            Priority: Minor
>
> 1) Submit a wordcount app
> 2) Stop all nodenamages when running 1st stage
> 3) After some minutes, start all nodemanages
> Now, this job will failed at ResultStage and then retry ShuffleMapStage, and then ResultStage
failed again, it sill running in this loop, and can't finish this job.
> This is because when stop all NMs, all the Containers are still alive, but executors
info will lost which stored on NM(YarnShuffleService), so even if all the NMs recover, the
tasks will failed on ResultStage when fetch shuffle data.
> {noformat}
> 16/04/06 17:02:14 WARN TaskSetManager: Lost task 2.0 in stage 1.11 (TID 220, spark-1):
FetchFailed(BlockManagerId(3, 192.168.42.175, 27337), shuffleId=0, mapId=4, reduceId=2, message=
> org.apache.spark.shuffle.FetchFailedException: java.lang.RuntimeException: Executor is
not registered (appId=application_1459927459378_0005, execId=3)
> ...
> 16/04/06 17:02:14 INFO YarnScheduler: Removed TaskSet 1.11, whose tasks have all completed,
from pool
> 16/04/06 17:02:14 INFO DAGScheduler: Resubmitting ShuffleMapStage 0 (map at wordcountWithSave.scala:21)
and ResultStage 1 (saveAsTextFile at wordcountWithSave.scala:32) due to fetch failure
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message