spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Assigned] (SPARK-14485) Task finished cause fetch failure when its executor has already removed by driver
Date Fri, 08 Apr 2016 08:01:25 GMT

     [ https://issues.apache.org/jira/browse/SPARK-14485?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Apache Spark reassigned SPARK-14485:
------------------------------------

    Assignee: Apache Spark

> Task finished cause fetch failure when its executor has already removed by driver 
> ----------------------------------------------------------------------------------
>
>                 Key: SPARK-14485
>                 URL: https://issues.apache.org/jira/browse/SPARK-14485
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 1.3.1, 1.5.2
>            Reporter: iward
>            Assignee: Apache Spark
>
> Now, when executor is removed by driver with heartbeats timeout, driver will re-queue
the task on this executor and send a kill command to cluster to kill this executor.
> But, in a situation, the running task of this executor is finished and return result
to driver before this executor killed by kill command sent by driver. At this situation, driver
will accept the task finished event and ignore  speculative task and re-queued task. But,
as we know, this executor has removed by driver, the result of this finished task can not
save in driver because the *BlockManagerId* has also removed from *BlockManagerMaster* by
driver. So, the result data of this stage is not complete, and then, it will cause fetch failure.
> For example, the following is the task log:
> {noformat}
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN HeartbeatReceiver: Removing executor
322 with no recent heartbeats: 256015 ms exceeds timeout 250000 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 ERROR YarnScheduler: Lost executor 322 on
BJHC-HERA-16168.hadoop.jd.local: Executor heartbeat timed out after 256015 ms
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO TaskSetManager: Re-queueing tasks for
322 from TaskSet 107.0
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 WARN TaskSetManager: Lost task 229.0 in stage
107.0 (TID 10384, BJHC-HERA-16168.hadoop.jd.local): ExecutorLostFailure (executor 322 lost)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO DAGScheduler: Executor lost: 322 (epoch
11)
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMasterEndpoint: Trying to
remove executor 322 from BlockManagerMaster.
> 2015-12-31 04:38:50 INFO 15/12/31 04:38:50 INFO BlockManagerMaster: Removed 322 successfully
in removeExecutor
> {noformat}
> {noformat}
> 2015-12-31 04:38:52 INFO 15/12/31 04:38:52 INFO TaskSetManager: Finished task 229.0 in
stage 107.0 (TID 10384) in 272315 ms on BJHC-HERA-16168.hadoop.jd.local (579/700)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Ignoring task-finished
event for 229.1 in stage 107.0 because task 229 has already completed successfully
> {noformat}
> {noformat}
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO DAGScheduler: Submitting 3 missing tasks
from ShuffleMapStage 107 (MapPartitionsRDD[263] at mapPartitions at Exchange.scala:137)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO YarnScheduler: Adding task set 107.1
with 3 tasks
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 0.0 in
stage 107.1 (TID 10863, BJHC-HERA-18043.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 1.0 in
stage 107.1 (TID 10864, BJHC-HERA-9291.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
> 2015-12-31 04:40:12 INFO 15/12/31 04:40:12 INFO TaskSetManager: Starting task 2.0 in
stage 107.1 (TID 10865, BJHC-HERA-16047.hadoop.jd.local, PROCESS_LOCAL, 3745 bytes)
> {noformat}
> Driver will check the stage's result is not complete, and submit missing task, but this
time, the next stage has run because previous stage has finish for its task is all finished
although its result is not complete.
> {noformat}
> 2015-12-31 04:40:13 INFO 15/12/31 04:40:13 WARN TaskSetManager: Lost task 39.0 in stage
109.0 (TID 10905, BJHC-HERA-9357.hadoop.jd.local): FetchFailed(null, shuffleId=11, mapId=-1,
reduceId=39, message=
> 2015-12-31 04:40:13 INFO org.apache.spark.shuffle.MetadataFetchFailedException: Missing
an output location for shuffle 11
> 2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:385)
> 2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$$anonfun$org$apache$spark$MapOutputTracker$$convertMapStatuses$1.apply(MapOutputTracker.scala:382)
> 2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$$anonfun$map$1.apply(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at scala.collection.IndexedSeqOptimized$class.foreach(IndexedSeqOptimized.scala:33)
> 2015-12-31 04:40:13 INFO at scala.collection.mutable.ArrayOps$ofRef.foreach(ArrayOps.scala:108)
> 2015-12-31 04:40:13 INFO at scala.collection.TraversableLike$class.map(TraversableLike.scala:244)
> 2015-12-31 04:40:13 INFO at scala.collection.mutable.ArrayOps$ofRef.map(ArrayOps.scala:108)
> 2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker$.org$apache$spark$MapOutputTracker$$convertMapStatuses(MapOutputTracker.scala:381)
> 2015-12-31 04:40:13 INFO at org.apache.spark.MapOutputTracker.getServerStatuses(MapOutputTracker.scala:172)
> 2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.SortShuffleReader.computeShuffleBlocks(SortShuffleReader.scala:301)
> 2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.SortShuffleReader.read(SortShuffleReader.scala:111)
> 2015-12-31 04:40:13 INFO at org.apache.spark.shuffle.sort.MixedShuffleReader.read(MixedShuffleReader.scala:41)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.ShuffledRDD.compute(ShuffledRDD.scala:90)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.ZippedPartitionsRDD2.compute(ZippedPartitionsRDD.scala:88)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.MapPartitionsRDD.compute(MapPartitionsRDD.scala:35)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.computeOrReadCheckpoint(RDD.scala:277)
> 2015-12-31 04:40:13 INFO at org.apache.spark.rdd.RDD.iterator(RDD.scala:244)
> 2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:70)
> 2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.ShuffleMapTask.runTask(ShuffleMapTask.scala:41)
> 2015-12-31 04:40:13 INFO at org.apache.spark.scheduler.Task.run(Task.scala:64)
> 2015-12-31 04:40:13 INFO at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:209)
> 2015-12-31 04:40:13 INFO at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1145)
> 2015-12-31 04:40:13 INFO at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:615)
> 2015-12-31 04:40:13 INFO at java.lang.Thread.run(Thread.java:745)
> 2015-12-31 04:40:13 INFO 
> {noformat}
> As the task log show, in this situation, it will casue FetchFailedException.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message