spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Yuanjian (JIRA)" <j...@apache.org>
Subject [jira] [Created] (SPARK-22074) Task killed by other attempt task should not be resubmitted
Date Wed, 20 Sep 2017 05:28:00 GMT
Li Yuanjian created SPARK-22074:
-----------------------------------

             Summary: Task killed by other attempt task should not be resubmitted
                 Key: SPARK-22074
                 URL: https://issues.apache.org/jira/browse/SPARK-22074
             Project: Spark
          Issue Type: Bug
          Components: Spark Core
    Affects Versions: 2.2.0, 2.1.0
            Reporter: Li Yuanjian


When a task killed by other task attempt, the task still resubmitted while its executor lost.
There is a certain probability caused the stage hanging forever because of the unnecessary
resubmit(see the scenario description below). Although the patch https://issues.apache.org/jira/browse/SPARK-13931
can resolve the hanging problem, but the unnecessary resubmit should abandon.


Detail scenario description:
1. A ShuffleMapStage has many tasks, some of them finished successfully
2. An Executor Lost happened, this will trigger a new TaskSet resubmitted, includes all missing
partitions.
3. Before the resubmitted TaskSet completed, another executor which only include the task
killed by other attempt lost, trigger the Resubmitted Event, current stage's pendingPartitions
is not empty.
4. Resubmitted TaskSet end, shuffleMapStage.isAvailable == true, but pendingPartitions is
not empty, never step into submitWaitingChildStages.

Leave the key logs of this scenario below:
{noformat}
393332:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO DAGScheduler: Submitting 120 missing
tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
393333:17/09/11 13:45:24 [dag-scheduler-event-loop] INFO YarnClusterScheduler: Adding task
set 1046.0 with 120 tasks
408766:17/09/11 13:46:25 [dispatcher-event-loop-5] INFO TaskSetManager: Starting task 66.0
in stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, executor 15, partition 66, PROCESS_LOCAL,
6237 bytes)

[1] Executor 15 lost, task 66.0 and 90.0 on it

410532:17/09/11 13:46:32 [dispatcher-event-loop-47] INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 15.
410900:17/09/11 13:46:33 [dispatcher-event-loop-34] INFO TaskSetManager: Starting task 66.1
in stage 1046.0 (TID 111400, hidden-baidu-host.baidu.com, executor 70, partition 66, PROCESS_LOCAL,
6237 bytes)

[2] Task 66.0 killed by 66.1

411315:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Killing attempt 0 for
task 66.0 in stage 1046.0 (TID 110761) on hidden-baidu-host.baidu.com as the attempt 1 succeeded
on hidden-baidu-host.baidu.com
411316:17/09/11 13:46:37 [task-result-getter-2] INFO TaskSetManager: Finished task 66.1 in
stage 1046.0 (TID 111400) in 3545 ms on hidden-baidu-host.baidu.com (executor 70) (115/120)

[3] Executor 7 lost, task 0.0 72.0 7.0 on it

411390:17/09/11 13:46:37 [dispatcher-event-loop-24] INFO YarnSchedulerBackend$YarnDriverEndpoint:
Disabling executor 7.
416014:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: ShuffleMapStage 1046
(rdd at AFDEntry.scala:116) finished in 94.577 s

[4] ShuffleMapStage 1046.0 finished, missing partition trigger resubmitted 1046.1

416019:17/09/1 13:46:59 [dag-scheduler-event- oop] INFO DAGScheduler: Resubmitting ShuffleMapStage
1046 (rdd at AFDEntry.scala:116) because some of its tasks had failed: 0, 72, 79
416020:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: Submitting ShuffleMapStage
1046 (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116), which has no missing parents
416030:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO DAGScheduler: Submitting 3 missing
tasks from ShuffleMapStage 1046 (MapPartitionsRDD[5321] at rdd at AFDEntry.scala:116)
416032:17/09/11 13:46:59 [dag-scheduler-event-loop] INFO YarnClusterScheduler: Adding task
set 1046.1 with 3 tasks
416034:17/09/11 13:46:59 [dispatcher-event-loop-21] INFO TaskSetManager: Starting task 0.0
in stage 1046.1 (TID 112788, hidden-baidu-host.baidu.com, executor 37, partition 0, PROCESS_LOCAL,
6237 bytes)
416037:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: Starting task 1.0
in stage 1046.1 (TID 112789, yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com, executor
69, partition 72, PROCESS_LOCAL, 6237 bytes)
416039:17/09/11 13:46:59 [dispatcher-event-loop-23] INFO TaskSetManager: Starting task 2.0
in stage 1046.1 (TID 112790, hidden-baidu-host.baidu.com, executor 26, partition 79, PROCESS_LOCAL,
6237 bytes)

[5] ShuffleMapStage 1046.1 still running, the attempted task killed by other trigger the Resubmitted
event

416646:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost task 66.0 in
stage 1046.0 (TID 110761, hidden-baidu-host.baidu.com, executor 15): ExecutorLostFailure (executor
15 exited caused by one of the running tasks) Reason: Container marked as failed: container_1502719603300_158941_01_104857616
on host: hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container released on
a *lost* node
416647:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: Resubmitted ShuffleMapTask(1046,
66), so marking it as still running
416648:17/09/11 13:47:01 [dispatcher-event-loop-26] WARN TaskSetManager: Lost task 90.0 in
stage 1046.0 (TID 110788, hidden-baidu-host.baidu.com, executor 15): ExecutorLostFailure (executor
15 exited caused by one of the running tasks) Reason: Container marked as failed: container_1502719603300_158941_01_104857616
on host: hidden-baidu-host.baidu.com. Exit status: -100. Diagnostics: Container released on
a *lost* node
416649:17/09/11 13:47:01 [dag-scheduler-event-loop] INFO DAGScheduler: Resubmitted ShuffleMapTask(1046,
90), so marking it as still running
417197:17/09/11 13:47:02 [task-result-getter-0] INFO TaskSetManager: Finished task 0.0 in
stage 1046.1 (TID 112788) in 3027 ms on hidden-baidu-host.baidu.com (executor 37) (1/3)
417206:17/09/11 13:47:02 [task-result-getter-1] INFO TaskSetManager: Finished task 1.0 in
stage 1046.1 (TID 112789) in 3106 ms on yq01-inf-nmg01-spark03-20160817113538.yq01.baidu.com
(executor 69) (2/3)
417383:17/09/11 13:47:03 [task-result-getter-0] INFO TaskSetManager: Finished task 2.0 in
stage 1046.1 (TID 112790) in 3634 ms on hidden-baidu-host.baidu.com (executor 26) (3/3)
417384:17/09/11 13:47:03 [task-result-getter-0] INFO YarnClusterScheduler: Removed TaskSet
1046.1, whose tasks have all completed, from pool

[6] Task 1046.1 success, but 1046 stage forever in running queue

417817:17/09/11 13:47:06 [task-result-getter-2] INFO YarnClusterScheduler: Removed TaskSet
1046.0, whose tasks have all completed, from pool
417872:17/09/11 13:47:06 [dag-scheduler-event-loop] INFO DAGScheduler: running: Set(ShuffleMapStage
1090, ResultStage 1069, ShuffleMapStage 1070, ShuffleMapStage 1113, ShuffleMapStage 1092,
ShuffleMapStage 1063, ShuffleMapStage 1086, ShuffleMapStage 1065, ShuffleMapStage 1109, ShuffleMapStage
1088, ShuffleMapStage 1111, ShuffleMapStage 1082, ResultStage 1104, ShuffleMapStage 1105,
ShuffleMapStage 1084, ShuffleMapStage 1076, ShuffleMapStage 1107, ShuffleMapStage 1099, ShuffleMapStage
1078, ShuffleMapStage 1101, ShuffleMapStage 1080, ShuffleMapStage 1072, ShuffleMapStage 1094,
ResultStage 1103, ShuffleMapStage 1074, ResultStage 1096, ShuffleMapStage 1067, ShuffleMapStage
1046, ShuffleMapStage 1097)
......
598792:17/09/11 13:58:55 [dag-scheduler-event-loop] INFO DAGScheduler: running: Set(ShuffleMapStage
1577, ShuffleMapStage 1548, ShuffleMapStage 1571, ShuffleMapStage 1550, ShuffleMapStage 1543,
ShuffleMapStage 1573, ShuffleMapStage 1065, ShuffleMapStage 1565, ResultStage 1545, ShuffleMapStage
1567, ShuffleMapStage 1546, ShuffleMapStage 1539, ShuffleMapStage 1569, ShuffleMapStage 1561,
ShuffleMapStage 1541, ShuffleMapStage 1533, ShuffleMapStage 1563, ResultStage 1556, ShuffleMapStage
1535, ShuffleMapStage 1557, ShuffleMapStage 1579, ShuffleMapStage 1537, ShuffleMapStage 1559,
ShuffleMapStage 1581, ShuffleMapStage 1552, ShuffleMapStage 1531, ShuffleMapStage 1575, ShuffleMapStage
1554, ShuffleMapStage 1046)
{noformat}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message