spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Li Yuanjian (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (SPARK-2926) Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
Date Tue, 14 Nov 2017 13:55:00 GMT

    [ https://issues.apache.org/jira/browse/SPARK-2926?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16251398#comment-16251398
] 

Li Yuanjian edited comment on SPARK-2926 at 11/14/17 1:54 PM:
--------------------------------------------------------------

During our work of migrating some old Hadoop job to Spark, I noticed this JIRA and the code
based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After produced some
scenario and ran some benchmark tests, I found that this shuffle mode can bring {color:red}12x~30x
boosting in task duration and reduce peak execution memory to 1/12 ~ 1/50{color} vs current
master version(see detail screenshot and test data in attatched pdf), especially the memory
reducing, in this shuffle mode Spark can support more data size in less memory usage. The
detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf".

I know that DataSet API will have better optimization and performance, but RDD API may still
useful for flexible control and old Spark/Hadoop jobs. For the better performance in ordering
cases and more cost-effective memory usage, maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out would be greatly
appreciated.


was (Author: xuanyuan):
During our work of migrating some old Hadoop job to Spark, I noticed this JIRA and the code
based on spark 1.x.

I re-implemented the old PR based on Spark 2.1 and current master branch. After produced some
scenario and ran some benchmark tests, I found that this shuffle mode can bring {color:red}12x~30x
boosting in task duration and reduce peak execution memory to 1/12 ~ 1/50{color}(see detail
screenshot and test data in attatched pdf) vs current master version, especially the memory
reducing, in this shuffle mode Spark can support more data size in less memory usage. The
detail doc attached in this jira named "SortShuffleReader on Spark 2.x.pdf".

I know that DataSet API will have better optimization and performance, but RDD API may still
useful for flexible control and old Spark/Hadoop jobs. For the better performance in ordering
cases and more cost-effective memory usage, maybe this PR is still worth to merge in to master.

I'll sort out current code base and give a PR soon. Any comments and trying out would be greatly
appreciated.

> Add MR-style (merge-sort) SortShuffleReader for sort-based shuffle
> ------------------------------------------------------------------
>
>                 Key: SPARK-2926
>                 URL: https://issues.apache.org/jira/browse/SPARK-2926
>             Project: Spark
>          Issue Type: Improvement
>          Components: Shuffle
>    Affects Versions: 1.1.0
>            Reporter: Saisai Shao
>            Assignee: Saisai Shao
>         Attachments: SortBasedShuffleRead.pdf, Spark Shuffle Test Report(contd).pdf,
Spark Shuffle Test Report.pdf
>
>
> Currently Spark has already integrated sort-based shuffle write, which greatly improve
the IO performance and reduce the memory consumption when reducer number is very large. But
for the reducer side, it still adopts the implementation of hash-based shuffle reader, which
neglects the ordering attributes of map output data in some situations.
> Here we propose a MR style sort-merge like shuffle reader for sort-based shuffle to better
improve the performance of sort-based shuffle.
> Working in progress code and performance test report will be posted later when some unit
test bugs are fixed.
> Any comments would be greatly appreciated. 
> Thanks a lot.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message