spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jose Soltren (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-16554) Spark should kill executors when they are blacklisted
Date Wed, 30 Nov 2016 16:49:59 GMT

    [ https://issues.apache.org/jira/browse/SPARK-16554?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15709080#comment-15709080
] 

Jose Soltren commented on SPARK-16554:
--------------------------------------

This is a nice idea but I'm not sure how feasible it is.

If we're going to kill an executor, it is because there has already been a task failure which
makes the executor suspect. There could be a network problem, or lost or corrupt local storage.
Even if we could recover hosted blocks off the executor I'm not certain how trustworthy they
would be.

There would need to be some mechanism in place to determine if recovered blocks are valid,
and perhaps a maximum number of tries or a timeout for this step as well. I'm not certain
how often this would provide a benefit as compared to simply recomputing using lineage.

I'll continue to think through this as I look further at the code.

> Spark should kill executors when they are blacklisted
> -----------------------------------------------------
>
>                 Key: SPARK-16554
>                 URL: https://issues.apache.org/jira/browse/SPARK-16554
>             Project: Spark
>          Issue Type: New Feature
>          Components: Scheduler
>            Reporter: Imran Rashid
>
> SPARK-8425 will allow blacklisting faulty executors and nodes.  However, these blacklisted
executors will continue to run.  This is bad for a few reasons:
> (1) Even if there is faulty-hardware, if the cluster is under-utilized spark may be able
to request another executor on a different node.
> (2) If there is a faulty-disk (the most common case of faulty-hardware), the cluster
manager may be able to allocate another executor on the same node, if it can exclude the bad
disk.  (Yarn will do this with its disk-health checker.)
> With dynamic allocation, this may seem less critical, as a blacklisted executor will
stop running new tasks and eventually get reclaimed.  However, if there is cached data on
those executors, they will not get killed till {{spark.dynamicAllocation.cachedExecutorIdleTimeout}}
expires, which is (effectively) infinite by default.
> Users may not *always* want to kill bad executors, so this must be configurable to some
extent.  At a minimum, it should be possible to enable / disable it; perhaps the executor
should be killed after it has been blacklisted a configurable {{N}} times.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message