spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Apache Spark (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-2787) Make sort-based shuffle write files directly when there is no sorting / aggregation and # of partitions is small
Date Wed, 06 Aug 2014 03:32:12 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2787?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14087197#comment-14087197
] 

Apache Spark commented on SPARK-2787:
-------------------------------------

User 'mateiz' has created a pull request for this issue:
https://github.com/apache/spark/pull/1799

> Make sort-based shuffle write files directly when there is no sorting / aggregation and
# of partitions is small
> ----------------------------------------------------------------------------------------------------------------
>
>                 Key: SPARK-2787
>                 URL: https://issues.apache.org/jira/browse/SPARK-2787
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Matei Zaharia
>            Assignee: Matei Zaharia
>
> Right now sort-based shuffle is slower than hash-based for operations like groupByKey
where data is passed straight from the map task to the reduce task, because it keeps building
up a buffer in memory and spilling it to disk instead of directly opening more files (thus
having more GC pressure), and then it has to read back and merge all the spills (incurring
both serialization cost and GC pressure). When the number of partitions is small enough (say
less than 100 or 200), we should just open N files, write stuff to them as in hash-based shuffle,
and then concatenate them at the end to still end up with a single file (avoiding the many-files
problem of the hash-based implementation).
> It may also be possible to avoid the concatenation but that introduces complexity in
serving the files, so I'd try concatenating them for now. We can benchmark it to see the performance.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message