spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-21219) Task retry occurs on same executor due to race condition with blacklisting
Date Wed, 12 Jul 2017 06:51:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-21219?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wenchen Fan updated SPARK-21219:
--------------------------------
    Fix Version/s: 2.2.1

> Task retry occurs on same executor due to race condition with blacklisting
> --------------------------------------------------------------------------
>
>                 Key: SPARK-21219
>                 URL: https://issues.apache.org/jira/browse/SPARK-21219
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler
>    Affects Versions: 2.1.1
>            Reporter: Eric Vandenberg
>            Assignee: Eric Vandenberg
>            Priority: Minor
>             Fix For: 2.2.1, 2.3.0
>
>         Attachments: spark_driver.log.anon, spark_executor.log.anon
>
>
> When a task fails it is (1) added into the pending task list and then (2) corresponding
black list policy is enforced (ie, specifying if it can/can't run on a particular node/executor/etc.)
 Unfortunately the ordering is such that retrying the task could assign the task to the same
executor, which, incidentally could be shutting down and immediately fail the retry.   Instead
the order should be (1) the black list state should be updated and then (2) the task assigned,
ensuring that the black list policy is properly enforced.
> The attached logs demonstrate the race condition.
> See spark_executor.log.anon:
> 1. Task 55.2 fails on the executor
> 17/06/20 13:25:07 ERROR Executor: Exception in task 55.2 in stage 5.0 (TID 39575)
> java.lang.OutOfMemoryError: Java heap space
> 2. Immediately the same executor is assigned the retry task:
> 17/06/20 13:25:07 INFO CoarseGrainedExecutorBackend: Got assigned task 39651
> 17/06/20 13:25:07 INFO Executor: Running task 55.3 in stage 5.0 (TID 39651)
> 3. The retry task of course fails since the executor is also shutting down due to the
original task 55.2 OOM failure.
> See the spark_driver.log.anon:
> The driver processes the lost task 55.2:
> 17/06/20 13:25:07 WARN TaskSetManager: Lost task 55.2 in stage 5.0 (TID 39575, foobar####.masked-server.com,
executor attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
java.lang.OutOfMemoryError: Java heap space
> The driver then receives the ExecutorLostFailure for the retry task 55.3 (although it's
obfuscated in these logs, the server info is same...)
> 17/06/20 13:25:10 WARN TaskSetManager: Lost task 55.3 in stage 5.0 (TID 39651, foobar####.masked-server.com,
executor attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0):
ExecutorLostFailure (executor attempt_foobar####.masked-server.com-####_####_####_####.masked-server.com-####_####_####_####_0
exited caused by one of the running tasks) Reason: Remote RPC client disassociated. Likely
due to containers exceeding thresholds, or network issues. Check driver logs for WARN messages.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message