spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ruslan Dautkhanov (JIRA)" <>
Subject [jira] [Commented] (SPARK-18838) High latency of event processing for large jobs
Date Thu, 18 May 2017 23:39:04 GMT


Ruslan Dautkhanov commented on SPARK-18838:

my 2 cents. Would be nice to explore idea of storing offsets of consumed messages in listeners
themselves, very much like Kafka consumers.
(based on my limited knowledge of spark event queue listeners, assuming each listeners don't
depend on each other and can read from the queue asynchronously) - so then if one of the "non-critical"
listeners can't keep up, messages will be lost just for that one listener, and it wouldn't
affect rest of listeners.
{quote}Alternatively, we could use two queues, one for internal listeners and another for
external ones{quote}Making a parallel with Kafka again, looks like we're talking here about
two "topics" 

> High latency of event processing for large jobs
> -----------------------------------------------
>                 Key: SPARK-18838
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>    Affects Versions: 2.0.0
>            Reporter: Sital Kedia
> Currently we are observing the issue of very high event processing delay in driver's
`ListenerBus` for large jobs with many tasks. Many critical component of the scheduler like
`ExecutorAllocationManager`, `HeartbeatReceiver` depend on the `ListenerBus` events and this
delay might hurt the job performance significantly or even fail the job.  For example, a significant
delay in receiving the `SparkListenerTaskStart` might cause `ExecutorAllocationManager` manager
to mistakenly remove an executor which is not idle.  
> The problem is that the event processor in `ListenerBus` is a single thread which loops
through all the Listeners for each event and processes each event synchronously
This single threaded processor often becomes the bottleneck for large jobs.  Also, if one
of the Listener is very slow, all the listeners will pay the price of delay incurred by the
slow listener. In addition to that a slow listener can cause events to be dropped from the
event queue which might be fatal to the job.
> To solve the above problems, we propose to get rid of the event queue and the single
threaded event processor. Instead each listener will have its own dedicate single threaded
executor service . When ever an event is posted, it will be submitted to executor service
of all the listeners. The Single threaded executor service will guarantee in order processing
of the events per listener.  The queue used for the executor service will be bounded to guarantee
we do not grow the memory indefinitely. The downside of this approach is separate event queue
per listener will increase the driver memory footprint. 

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message