Mailing-List: contact issues-help@spark.apache.org; run by ezmlm
Precedence: bulk
Date: Mon, 28 Sep 2015 03:00:06 +0000 (UTC)
From: "SuYan (JIRA)" <jira@apache.org>
To: issues@spark.apache.org
Message-ID: <JIRA.12896106.1443100090000.86811.1443409206259@Atlassian.JIRA>
In-Reply-To: <JIRA.12896106.1443100090000@Atlassian.JIRA>
References: <JIRA.12896106.1443100090000@Atlassian.JIRA>
 <JIRA.12896106.1443100090529@arcas>
Subject: [jira] [Commented] (SPARK-10796) The Stage taskSets may are all
 removed while stage still have pending partitions after having lost some
 executors
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable


    [ https://issues.apache.org/jira/browse/SPARK-10796?page=3Dcom.atlassia=
n.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D149=
09990#comment-14909990 ]=20

SuYan commented on SPARK-10796:
-------------------------------


Running Stage 0, running TaskSet0.0, Finshed task0.0 in ExecA,  running Tas=
k1.0 in ExecB, waiting Task2.0
---> Task1 throws FetchFailedException
---> Running Resubmited stage 0, running TaskSet0.1(which re-run Task1, Tas=
k2), assume Task 1.0 finshed in ExecA
---> ExecA lost, and it happens no one throw FetchFailedExecption.
---> TaskSet0.1 re-submit task 1, re-add it into pendingTasks, and waiting =
TaskSchedulerImp schedule.
     TaskSet 0.0 also resubmit task0, re-add it into pendingTasks, because =
it=E2=80=98s Zombie, TaskSchedulerImpl skip to schedule TaskSet0.0

So if TaskSet0.0 and TaskSet0.1 (isZombie && runningTasks.empty), TaskSched=
ulerImp will remove those TaskSets.
DagScheduler still have pendingPartitions due to the task lost in TaskSet0.=
0, but his TaskSets are all removed, so hangs....


> The Stage taskSets may are all removed while stage still have pending par=
titions after having lost some executors
> -------------------------------------------------------------------------=
----------------------------------------
>
>                 Key: SPARK-10796
>                 URL: https://issues.apache.org/jira/browse/SPARK-10796
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 1.3.0
>            Reporter: SuYan
>            Priority: Minor
>
> We meet that problem in Spark 1.3.0, and I also check the latest Spark co=
de, and I think that problem still exist.
> 1. We know a running *ShuffleMapStage* will have multiple *TaskSet*: one =
Active TaskSet, multiple Zombie TaskSet.=20
> 2. We think a running *ShuffleMapStage* is success only if its partition =
are all process success, namely each task=E2=80=98s *MapStatus* are all add=
 into *outputLocs*
> 3. *MapStatus* of running *ShuffleMapStage* may succeed by Zombie TaskSet=
1 / Zombie TaskSet2 /..../ Active TaskSetN, and may some MapStatus only bel=
ong to one TaskSet, and may be a Zombie TaskSet.
> 4. If lost a executor, it chanced that some lost-executor related *MapSta=
tus* are succeed by some Zombie TaskSet.  In current logical, The solution =
to resolved that lost *MapStatus* problem is, each *TaskSet* re-running tha=
t those tasks which succeed in lost-executor: re-add into *TaskSet's pendin=
gTasks*, and re-add it paritions into *Stage=E2=80=98s pendingPartitions* .=
 but it is useless if that lost *MapStatus* only belong to *Zombie TaskSet*=
, it is Zombie, so will never be scheduled his *pendingTasks*
> 5. The condition for resubmit stage is only if some task throws *FetchFai=
ledException*, but may the lost-executor just not empty any *MapStatus* of =
parent Stage for one of running Stages, and it=E2=80=98s happen to that run=
ning Stage was lost a *MapStatus*  only belong to a *ZombieTask*. So if all=
 Zombie TaskSets are all processed his runningTasks and Active TaskSet are =
all processed his pendingTask, then will removed by *TaskSchedulerImp*, the=
n that running Stage's *pending partitions* is still nonEmpty. it will hang=
s......


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org