spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Igor Berman (JIRA)" <>
Subject [jira] [Commented] (SPARK-23423) Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
Date Thu, 15 Feb 2018 11:55:00 GMT


Igor Berman commented on SPARK-23423:

[~skonto], yes this is correct. Besides TASK_RUNNING reports(many) there are only 2 TASK_FAILURE
reports present(there are no FINISHED, LOST tasks)
We have cluster with approx 100 machines and for this specific framework many executors were
running (40 or so), many of them were killed during scale down(>> 2). I can see in
mesos ui that there are completed tasks with KILLED state for this application.

the question is where all other reports? Can you think of any reason they won't be reported
or driver won't register it?


> Application declines any offers when killed+active executors rich spark.dynamicAllocation.maxExecutors
> ------------------------------------------------------------------------------------------------------
>                 Key: SPARK-23423
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Mesos, Spark Core
>    Affects Versions: 2.2.1
>            Reporter: Igor Berman
>            Priority: Major
> Hi
> Mesos Version:1.1.0
> I've noticed rather strange behavior of MesosCoarseGrainedSchedulerBackend when running
on Mesos with dynamic allocation on and limiting number of max executors by spark.dynamicAllocation.maxExecutors.
> Suppose we have long running driver that has cyclic pattern of resource consumption(with
some idle times in between), due to dyn.allocation it receives offers and then releases them
after current chunk of work processed.
> Since at [] the
backend compares numExecutors < executorLimit and 
> numExecutors is defined as and slaves holds all slaves
ever "met", i.e. both active and killed (see comment [] 
> On the other hand, number of taskIds should be updated due to statusUpdate, but suppose
this update is lost(actually I don't see logs of 'is now TASK_KILLED') so this number of executors
might be wrong
> I've created test that "reproduces" this behavior, not sure how good it is:
> {code:java}
> //MesosCoarseGrainedSchedulerBackendSuite
> test("max executors registered stops to accept offers when dynamic allocation enabled")
>   setBackend(Map(
>     "spark.dynamicAllocation.maxExecutors" -> "1",
>     "spark.dynamicAllocation.enabled" -> "true",
>     "spark.dynamicAllocation.testing" -> "true"))
>   backend.doRequestTotalExecutors(1)
>   val (mem, cpu) = (backend.executorMemory(sc), 4)
>   val offer1 = createOffer("o1", "s1", mem, cpu)
>   backend.resourceOffers(driver, List(offer1).asJava)
>   verifyTaskLaunched(driver, "o1")
>   backend.doKillExecutors(List("0"))
>   verify(driver, times(1)).killTask(createTaskId("0"))
>   val offer2 = createOffer("o2", "s2", mem, cpu)
>   backend.resourceOffers(driver, List(offer2).asJava)
>   verify(driver, times(1)).declineOffer(offer2.getId)
> }{code}
> Workaround: Don't set maxExecutors with dynamicAllocation on
> Please advice
> Igor
> marking you friends since you were last to touch this piece of code and probably can
advice something([~vanzin], [~skonto], [~susanxhuynh])

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message