spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Reynold Xin (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-16230) Executors self-killing after being assigned tasks while still in init
Date Wed, 20 Jul 2016 02:32:21 GMT

     [ https://issues.apache.org/jira/browse/SPARK-16230?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Reynold Xin updated SPARK-16230:
--------------------------------
    Fix Version/s:     (was: 2.0.1)
                       (was: 2.1.0)
                   2.0.0

> Executors self-killing after being assigned tasks while still in init
> ---------------------------------------------------------------------
>
>                 Key: SPARK-16230
>                 URL: https://issues.apache.org/jira/browse/SPARK-16230
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>            Reporter: Tejas Patil
>            Assignee: Tejas Patil
>            Priority: Minor
>             Fix For: 2.0.0
>
>
> I see this happening frequently in our prod clusters:
> * EXECUTOR:   [CoarseGrainedExecutorBackend|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L61]
sends request to register itself to the driver.
> * DRIVER: Registers executor and [replies|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L179]
> * EXECUTOR:  ExecutorBackend receives ACK and [starts creating an Executor|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L81]
> * DRIVER:  Tries to launch a task as it knows there is a new executor. Sends a [LaunchTask|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/scheduler/cluster/CoarseGrainedSchedulerBackend.scala#L268]
to this new executor.
> * EXECUTOR:  Executor is not init'ed (one of the reasons I have seen is because it was
still trying to register to local external shuffle service). Meanwhile, receives a `LaunchTask`.
[Kills itself|https://github.com/apache/spark/blob/master/core/src/main/scala/org/apache/spark/executor/CoarseGrainedExecutorBackend.scala#L90]
as Executor is not init'ed.
> The driver assumes that Executor is ready to accept tasks as soon as it is registered
but thats not true.
> How this affects jobs / cluster:
> * We waste time + resources with these executors but they don't do any meaningful computation.
> * Driver thinks that the executor has started running the task but since the Executor
has self killed, it does not tell driver (BTW: this is also another issue which I think could
be fixed separately). Driver waits for 10 mins and then declares the executor dead. This adds
up to the latency of the job. Plus, failure attempts also gets bumped up for the tasks despite
the tasks were never started. For unlucky tasks, this might cause the job failure.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message