spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Cheolsoo Park (JIRA)" <>
Subject [jira] [Updated] (SPARK-6954) Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become negative
Date Thu, 16 Apr 2015 02:33:58 GMT


Cheolsoo Park updated SPARK-6954:
    Affects Version/s:     (was: 1.3.0)

Hi [~sandyr], Thank you for the question.

I am actually running 1.3.1-RC3, and I just confirmed that SPARK-6325 is in the commit log
of my release branch.

I updated the affects version to 1.3.1 to avoid confusion.

> Dynamic allocation: numExecutorsPending in ExecutorAllocationManager should never become
> -------------------------------------------------------------------------------------------------
>                 Key: SPARK-6954
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: YARN
>    Affects Versions: 1.3.1
>            Reporter: Cheolsoo Park
>            Priority: Minor
>              Labels: yarn
> I have a simple test case for dynamic allocation on YARN that fails with the following
stack trace-
> {code}
> 15/04/16 00:52:14 ERROR Utils: Uncaught exception in thread spark-dynamic-executor-allocation-0
> java.lang.IllegalArgumentException: Attempted to request a negative number of executor(s)
-21 from the cluster manager. Please specify a positive number!
> 	at org.apache.spark.scheduler.cluster.CoarseGrainedSchedulerBackend.requestTotalExecutors(CoarseGrainedSchedulerBackend.scala:338)
> 	at org.apache.spark.SparkContext.requestTotalExecutors(SparkContext.scala:1137)
> 	at org.apache.spark.ExecutorAllocationManager.addExecutors(ExecutorAllocationManager.scala:294)
> 	at org.apache.spark.ExecutorAllocationManager.addOrCancelExecutorRequests(ExecutorAllocationManager.scala:263)
> 	at$apache$spark$ExecutorAllocationManager$$schedule(ExecutorAllocationManager.scala:230)
> 	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply$mcV$sp(ExecutorAllocationManager.scala:189)
> 	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
> 	at org.apache.spark.ExecutorAllocationManager$$anon$1$$anonfun$run$1.apply(ExecutorAllocationManager.scala:189)
> 	at org.apache.spark.util.Utils$.logUncaughtExceptions(Utils.scala:1618)
> 	at org.apache.spark.ExecutorAllocationManager$$anon$
> 	at java.util.concurrent.Executors$
> 	at java.util.concurrent.FutureTask.runAndReset(
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$301(
> 	at java.util.concurrent.ScheduledThreadPoolExecutor$
> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(
> 	at java.util.concurrent.ThreadPoolExecutor$
> 	at
> {code}
> My test is as follows-
> # Start spark-shell with a single executor.
> # Run a {{select count(\*)}} query. The number of executors rises as input size is non-trivial.
> # After the job finishes, the number of  executors falls as most of them become idle.
> # Rerun the same query again, and the request to add executors fails with the above error.
In fact, the job itself continues to run with whatever executors it already has, but it never
gets more executors unless the shell is closed and restarted. 
> In fact, this error only happens when I configure {{executorIdleTimeout}} very small.
For eg, I can reproduce it with the following configs-
> {code}
> spark.dynamicAllocation.executorIdleTimeout     5
> spark.dynamicAllocation.schedulerBacklogTimeout 5
> {code}
> Although I can simply increase {{executorIdleTimeout}} to something like 60 secs to avoid
the error, I think this is still a bug to be fixed.
> The root cause seems that {{numExecutorsPending}} accidentally becomes negative if executors
are killed too aggressively (i.e. {{executorIdleTimeout}} is too small) because under that
circumstance, the new target # of executors can be smaller than the current # of executors.
When that happens, {{ExecutorAllocationManager}} ends up trying to add a negative number of
executors, which throws an exception.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message