spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hyukjin Kwon (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19108) Broadcast all shared parts of tasks (to reduce task serialization time)
Date Tue, 21 May 2019 04:01:15 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19108?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Hyukjin Kwon updated SPARK-19108:
---------------------------------
    Labels: bulk-closed  (was: )

> Broadcast all shared parts of tasks (to reduce task serialization time)
> -----------------------------------------------------------------------
>
>                 Key: SPARK-19108
>                 URL: https://issues.apache.org/jira/browse/SPARK-19108
>             Project: Spark
>          Issue Type: Improvement
>          Components: Scheduler
>            Reporter: Kay Ousterhout
>            Priority: Major
>              Labels: bulk-closed
>
> Expand the amount of information that's broadcasted for tasks, to avoid serializing data
per-task that should only be sent to each executor once for the entire stage.
> Conceptually, this means we'd have new classes  specially for sending the minimal necessary
data to the executor, like:
> {code}
> /**
>   * metadata about the taskset needed by the executor for all tasks in this taskset.
 Subset of the
>   * full data kept on the driver to make it faster to serialize and send to executors.
>   */
> class ExecutorTaskSetMeta(
>   val stageId: Int,
>   val stageAttemptId: Int,
>   val properties: Properties,
>   val addedFiles: Map[String, String],
>   val addedJars: Map[String, String]
>   // maybe task metrics here?
> )
> class ExecutorTaskData(
>   val partitionId: Int,
>   val attemptNumber: Int,
>   val taskId: Long,
>   val taskBinary: Broadcast[Array[Byte]],
>   val taskSetMeta: Broadcast[ExecutorTaskSetMeta]
> )
> {code}
> Then all the info you'd need to send to the executors would be a serialized version of
ExecutorTaskData.  Furthermore, given the simplicity of that class, you could serialize manually,
and then for each task you could just modify the first two ints & one long directly in
the byte buffer.  (You could do the same trick for serialization even if ExecutorTaskSetMeta
was not a broadcast, but that will keep the msgs small as well.)
> There a bunch of details I'm skipping here: you'd also need to do some special handling
for the TaskMetrics; the way tasks get started in the executor would change; you'd also need
to refactor {{Task}} to let it get reconstructed from this information (or add more to ExecutorTaskSetMeta);
and probably other details I'm overlooking now.
> (this is copied from SPARK-18890 and [~imranr]'s comment there; cc [~shivaram])



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message