spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <>
Subject [jira] [Assigned] (SPARK-23496) Locality of coalesced partitions can be severely skewed by the order of input partitions
Date Fri, 23 Feb 2018 14:47:03 GMT


Apache Spark reassigned SPARK-23496:

    Assignee: Apache Spark

> Locality of coalesced partitions can be severely skewed by the order of input partitions
> ----------------------------------------------------------------------------------------
>                 Key: SPARK-23496
>                 URL:
>             Project: Spark
>          Issue Type: Bug
>          Components: Spark Core
>    Affects Versions: 3.0.0
>            Reporter: Ala Luszczak
>            Assignee: Apache Spark
>            Priority: Major
> Example:
> Consider RDD "R" with 100 partitions, half of which have locality preference "hostA"
and half have "hostB".
>  * Assume odd-numbered input partitions of R prefer "hostA" and even-numbered prefer
"hostB". Then R.coalesce(50) will have 25 partitions with preference "hostA" and 25 with
"hostB" (even distribution).
>  * Assume partitions with index 0-49 of R prefer "hostA" and partitions with index 50-99
prefer "hostB". Then R.coalesce(50) will have 49 partitions with "hostA" and 1 with "hostB"
(extremely skewed distribution).
> The algorithm in {{DefaultPartitionCoalescer.setupGroups}} is responsible for picking
preferred locations for coalesced partitions. It analyzes the preferred locations of input
partitions. It starts by trying to create one partition for each unique location in the input.
However, if the the requested number of coalesced partitions is higher that the number of
unique locations, it has to pick duplicate locations.
> Currently, the duplicate locations are picked by iterating over the input partitions
in order, and copying their preferred locations to coalesced partitions. If the input partitions are
clustered by location, this can result in severe skew.
> Instead of iterating over the list of input partitions in order, we should pick them
at random.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail:
For additional commands, e-mail:

View raw message