spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-21460) Spark dynamic allocation breaks when ListenerBus event queue runs full
Date Thu, 20 Jul 2017 18:08:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-21460?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16095131#comment-16095131
] 

Ruslan Dautkhanov commented on SPARK-21460:
-------------------------------------------

[~Dhruve Ashar], I can email logs to you. Although logs are not so revealing, basically problem
starts at 
{noformat}
ERROR [2017-05-15 10:37:53,350] ({dag-scheduler-event-loop} Logging.scala[logError]:70) -
Dropping SparkListenerEvent because no remaining room in event queue. 
This likely means one of the SparkListeners is too slow and cannot keep up with the rate at
which tasks are being started by the scheduler. 
{noformat}
and then nothing interesting. 

We were hitting it constantly until following changes were made:
- disable concurrentSQL
- increase spark.scheduler.listenerbus.eventqueue.size to 55000
- spark.dynamicAllocation.maxExecutors set to 210 (it was not set /unlimited)

After that we have seen it rarely - a few times (changes were made back in May). Also it was
happening mostly with several users who were using concurrentSQL actively (was submitting
multiple jobs before previous ones completed). Although concurrentSQL isn't the problem -
it just makes ListenerBus event queue runs full quicker. Again, we have seen a few times the
same issue after above workaround changes were made including disabling concurrentSQL.

> Spark dynamic allocation breaks when ListenerBus event queue runs full
> ----------------------------------------------------------------------
>
>                 Key: SPARK-21460
>                 URL: https://issues.apache.org/jira/browse/SPARK-21460
>             Project: Spark
>          Issue Type: Bug
>          Components: Scheduler, YARN
>    Affects Versions: 2.0.0, 2.0.2, 2.1.0, 2.1.1, 2.2.0
>         Environment: Spark 2.1 
> Hadoop 2.6
>            Reporter: Ruslan Dautkhanov
>            Priority: Critical
>              Labels: dynamic_allocation, performance, scheduler, yarn
>
> When ListenerBus event queue runs full, spark dynamic allocation stops working - Spark
fails to shrink number of executors when there are no active jobs (Spark driver "thinks" there
are active jobs since it didn't capture when they finished) .
> ps. What's worse it also makes Spark flood YARN RM with reservation requests, so YARN
preemption doesn't function properly too (we're on Spark 2.1 / Hadoop 2.6). 



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message