spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Graves (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-22148) TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors are blacklisted but dynamic allocation is enabled
Date Thu, 14 Jun 2018 16:14:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-22148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16512691#comment-16512691
] 

Thomas Graves commented on SPARK-22148:
---------------------------------------

ok, just update if you start working on it. thanks.

> TaskSetManager.abortIfCompletelyBlacklisted should not abort when all current executors
are blacklisted but dynamic allocation is enabled
> -----------------------------------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-22148
>                 URL: https://issues.apache.org/jira/browse/SPARK-22148
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, Spark Core
>    Affects Versions: 2.2.0
>            Reporter: Juan Rodríguez Hortalá
>            Priority: Major
>         Attachments: SPARK-22148_WIP.diff
>
>
> Currently TaskSetManager.abortIfCompletelyBlacklisted aborts the TaskSet and the whole
Spark job with `task X (partition Y) cannot run anywhere due to node and executor blacklist.
Blacklisting behavior can be configured via spark.blacklist.*.` when all the available executors
are blacklisted for a pending Task or TaskSet. This makes sense for static allocation, where
the set of executors is fixed for the duration of the application, but this might lead to
unnecessary job failures when dynamic allocation is enabled. For example, in a Spark application
with a single job at a time, when a node fails at the end of a stage attempt, all other executors
will complete their tasks, but the tasks running in the executors of the failing node will
be pending. Spark will keep waiting for those tasks for 2 minutes by default (spark.network.timeout)
until the heartbeat timeout is triggered, and then it will blacklist those executors for that
stage. At that point in time, other executors would had been released after being idle for
1 minute by default (spark.dynamicAllocation.executorIdleTimeout), because the next stage
hasn't started yet and so there are no more tasks available (assuming the default of spark.speculation
= false). So Spark will fail because the only executors available are blacklisted for that
stage. 
> An alternative is requesting more executors to the cluster manager in this situation.
This could be retried a configurable number of times after a configurable wait time between
request attempts, so if the cluster manager fails to provide a suitable executor then the
job is aborted like in the previous case. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message