spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <>
Subject [jira] [Updated] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)
Date Tue, 21 May 2019 04:01:15 GMT


Hyukjin Kwon updated SPARK-19108:
    Labels: bulk-closed  (was: )

> Broadcast all shared parts of tasks (to reduce task serialization time)
> -----------------------------------------------------------------------
>                 Key: SPARK-19108
>                 URL:
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Kay Ousterhout
>            Priority: Major
>              Labels: bulk-closed
> Expand the amount of information that's broadcasted for tasks, to avoid serializing data
per-task that should only be sent to each executor once for the entire stage.
> Conceptually, this means we'd have new classes  specially for sending the minimal necessary
data to the executor, like:
> {code}
> /**
>   * metadata about the taskset needed by the executor for all tasks in this taskset.
 Subset of the
>   * full data kept on the driver to make it faster to serialize and send to executors.
>   */
> class ExecutorTaskSetMeta(
>   val stageId: Int,
>   val stageAttemptId: Int,
>   val properties: Properties,
>   val addedFiles: Map[String, String],
>   val addedJars: Map[String, String]
>   // maybe task metrics here?
> )
> class ExecutorTaskData(
>   val partitionId: Int,
>   val attemptNumber: Int,
>   val taskId: Long,
>   val taskBinary: Broadcast[Array[Byte]],
>   val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
> )
> {code}
> Then all the info you'd need to send to the executors would be a serialized version of
ExecutorTaskData.  Furthermore, given the simplicity of that class, you could serialize manually,
and then for each task you could just modify the first two ints & one long directly in
the byte buffer.  (You could do the same trick for serialization even if ExecutorTaskSetMeta
was not a broadcast, but that will keep the msgs small as well.)
> There a bunch of details I'm skipping here: you'd also need to do some special handling
for the TaskMetrics; the way tasks get started in the executor would change; you'd also need
to refactor {{Task}} to let it get reconstructed from this information (or add more to ExecutorTaskSetMeta);
and probably other details I'm overlooking now.
> (this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram])

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message