spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <>
Subject [jira] [Updated] (SPARK-15690) Fast single-node (single-process) in-memory shuffle
Date Tue, 11 Sep 2018 14:35:00 GMT


Wenchen Fan updated SPARK-15690:
    Target Version/s:   (was: 2.4.0)

> Fast single-node (single-process) in-memory shuffle
> ---------------------------------------------------
>                 Key: SPARK-15690
>                 URL:
>             Project: Spark
>          Issue Type: New Feature
>          Components: Shuffle, SQL
>            Reporter: Reynold Xin
>            Priority: Major
> Spark's current shuffle implementation sorts all intermediate data by their partition
id, and then write the data to disk. This is not a big bottleneck because the network throughput
on commodity clusters tend to be low. However, an increasing number of Spark users are using
the system to process data on a single-node. When in a single node operating against intermediate
data that fits in memory, the existing shuffle code path can become a big bottleneck.
> The goal of this ticket is to change Spark so it can use in-memory radix sort to do data
shuffling on a single node, and still gracefully fallback to disk if the data size does not
fit in memory. Given the number of partitions is usually small (say less than 256), it'd require
only a single pass do to the radix sort with pretty decent CPU efficiency.
> Note that there have been many in-memory shuffle attempts in the past. This ticket has
a smaller scope (single-process), and aims to actually productionize this code.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message