spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sean Owen (JIRA)" <>
Subject [jira] [Resolved] (SPARK-20564) a lot of executor failures when the executor number is more than 2000
Date Mon, 31 Jul 2017 09:13:00 GMT


Sean Owen resolved SPARK-20564.
    Resolution: Won't Fix

> a lot of executor failures when the executor number is more than 2000
> ---------------------------------------------------------------------
>                 Key: SPARK-20564
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Deploy
>    Affects Versions: 1.6.2, 2.1.0
>            Reporter: Hua Liu
>            Priority: Minor
> When we used more than 2000 executors in a spark application, we noticed a large number
of executors cannot connect to driver and as a result they were marked as failed. In some
cases, the failed executor number reached twice of the requested executor count and thus applications
retried and may eventually fail.
> This is because that YarnAllocator requests all missing containers every spark.yarn.scheduler.heartbeat.interval-ms
(default 3 seconds). For example, YarnAllocator can ask for and get over 2000 containers in
one request, and then launch them almost simultaneously. These thousands of executors try
to retrieve spark props and register with driver within seconds. However, driver handles executor
registration, stop, removal and spark props retrieval in one thread, and it can not handle
such a large number of RPCs within a short period of time. As a result, some executors cannot
retrieve spark props and/or register. These failed executors are then marked as failed, causing
executor removal and aggravating the overloading of driver, which leads to more executor failures.

> This patch adds an extra configuration spark.yarn.launchContainer.count.simultaneously,
which caps the maximal number of containers that driver can ask for in every spark.yarn.scheduler.heartbeat.interval-ms.
As a result, the number of executors grows steadily. The number of executor failures is reduced
and applications can reach the desired number of executors faster.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message