spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Imran Rashid (JIRA)" <>
Subject [jira] [Commented] (SPARK-23485) Kubernetes should support node blacklist
Date Fri, 23 Feb 2018 21:59:01 GMT


Imran Rashid commented on SPARK-23485:

I think this is because the general expectation is that a failure on a given node will just
cause new executors to spin up on different nodes and eventually the application will succeed.

I think this is the part which may be particularly different in spark.  Some types of failures
do not cause the executor to die -- its just a task failure, and the executor itself is still
alive.  As long as Spark gets heartbeats from the executor, it figures its still fine.  But
a bad disk can cause *tasks* to repeatedly fail.  Could be true for other resources, eg. a
bad gpu, and maybe the gpu is only used by certain tasks.

When that happens, without spark's internal blacklisting, an application will very quickly
hit many task failures.  The task fails, spark notices that, tries to find a place to assign
the failed task, puts it back in the same place; repeat till spark decides there are too many
failures and gives up.  It can easily cause your app to fail in ~1 second.  There is no communication
with the cluster manager through this process, its all just between the spark's driver &
executor.  In one case, when this happened yarn's own health checker discovered the problem
a few mins after it occurred -- but the spark app had already failed by that point.  From
one bad disk in a cluster w/ > 1000 disks.

Spark's blacklisting is really meant to be complementary to the type of node health checks
you are talking about in kubernetes.  The blacklisting in spark intentionally does not try
to figure out the root cause of the problem, as we don't want to get into the game of enumerating
all of the possibilities.  Its a heuristic which makes it safe for spark to keep going in
case of these un-caught errors, but then retries the resources when it would be safe to do
so. (discussed in more detail in the design doc on SPARK-8425.)

anti-affinity in kubernetes may be just the trick, though this part of the doc was a little

Note: Inter-pod affinity and anti-affinity require substantial amount of processing which
can slow down scheduling in large clusters significantly. We do not recommend using them in
clusters larger than several hundred nodes.

Blacklisting is *most* important in large clusters.  It seems like its able to do something
much more complicated than a simple node blacklist, though -- maybe it would already be faster
with such a simple anti-affinity rule?

> Kubernetes should support node blacklist
> ----------------------------------------
>                 Key: SPARK-23485
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Kubernetes, Scheduler
>    Affects Versions: 2.3.0
>            Reporter: Imran Rashid
>            Priority: Major
> Spark's BlacklistTracker maintains a list of "bad nodes" which it will not use for running
tasks (eg., because of bad hardware).  When running in yarn, this blacklist is used to avoid
ever allocating resources on blacklisted nodes:
> I'm just beginning to poke around the kubernetes code, so apologies if this is incorrect
-- but I didn't see any references to {{scheduler.nodeBlacklist()}} in {{KubernetesClusterSchedulerBackend}}
so it seems this is missing.  Thought of this while looking at SPARK-19755, a similar issue
on mesos.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message