spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "SuYan (JIRA)" <>
Subject [jira] [Commented] (SPARK-10796) The Stage taskSets may are all removed while stage still have pending partitions after having lost some executors
Date Mon, 28 Sep 2015 03:00:06 GMT


SuYan commented on SPARK-10796:

Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Task1.0 in ExecB,
waiting Task2.0
---> Task1 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Task2), assume
Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting TaskSchedulerImp
     TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because it‘s Zombie,
TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), TaskSchedulerImp
will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.0, but his TaskSets
are all removed, so hangs....

> The Stage taskSets may are all removed while stage still have pending partitions after
having lost some executors
> -----------------------------------------------------------------------------------------------------------------
>                 Key: SPARK-10796
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.3.0
>            Reporter: SuYan
>            Priority: Minor
> We meet that problem in Spark 1.3.0, and I also check the latest Spark code, and I think
that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one Active TaskSet,
multiple Zombie TaskSet. 
> 2. We think a running *ShuffleMapStage* is success only if its partition are all process
success, namely each task‘s *MapStatus* are all add into *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet1 / Zombie TaskSet2
/..../ Active TaskSetN, and may some MapStatus only belong to one TaskSet, and may be a Zombie
> 4. If lost a executor, it chanced that some lost-executor related *MapStatus* are succeed
by some Zombie TaskSet.  In current logical, The solution to resolved that lost *MapStatus*
problem is, each *TaskSet* re-running that those tasks which succeed in lost-executor: re-add
into *TaskSet's pendingTasks*, and re-add it paritions into *Stage‘s pendingPartitions*
. but it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*, it is Zombie,
so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws *FetchFailedException*,
but may the lost-executor just not empty any *MapStatus* of parent Stage for one of running
Stages, and it‘s happen to that running Stage was lost a *MapStatus*  only belong to a *ZombieTask*.
So if all Zombie TaskSets are all processed his runningTasks and Active TaskSet are all processed
his pendingTask, then will removed by *TaskSchedulerImp*, then that running Stage's *pending
partitions* is still nonEmpty. it will hangs......

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message