spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Wenchen Fan (JIRA)" <j...@apache.org>
Subject [jira] [Reopened] (SPARK-19355) Use map output statistices to improve global limit's parallelism
Date Thu, 20 Sep 2018 12:21:00 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19355?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Wenchen Fan reopened SPARK-19355:
---------------------------------

> Use map output statistices to improve global limit's parallelism
> ----------------------------------------------------------------
>
>                 Key: SPARK-19355
>                 URL: https://issues.apache.org/jira/browse/SPARK-19355
>             Project: Spark
>          Issue Type: Improvement
>          Components: SQL
>            Reporter: Liang-Chi Hsieh
>            Assignee: Liang-Chi Hsieh
>            Priority: Major
>             Fix For: 2.4.0
>
>
> A logical Limit is performed actually by two physical operations LocalLimit and GlobalLimit.
> In most of time, before GlobalLimit, we will perform a shuffle exchange to shuffle data
to single partition. When the limit number is very big, we shuffle a lot of data to a single
partition and significantly reduce parallelism, except for the cost of shuffling.
> This change tries to perform GlobalLimit without shuffling data to single partition.
Instead, we perform the map stage of the shuffling and collect the statistics of the number
of rows in each partition. Shuffled data are actually all retrieved locally without from remote
executors.
> Once we get the number of output rows in each partition, we only take the required number
of rows from the locally shuffled data.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message