spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Josh Rosen (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (SPARK-3280) Made sort-based shuffle the default implementation
Date Thu, 28 Aug 2014 18:16:09 GMT

    [ https://issues.apache.org/jira/browse/SPARK-3280?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14114092#comment-14114092
] 

Josh Rosen commented on SPARK-3280:
-----------------------------------

Here are some numbers from August 10.  If I recall, this was running on 8 m3.8xlarge nodes.
 This test linearly scales a bunch of parameters (data set size, numbers of mappers and reducers,
etc).  You can see that hash-based shuffle's performance degrades severely in cases where
we have many mappers and reducers, while sort scales much more gracefully:

!http://i.imgur.com/rODzaG1.png!

!http://i.imgur.com/72kCkH5.png!

This was run with spark-perf; here's a sample config for one of the bars:

{code}
Java options: -Dspark.storage.memoryFraction=0.66 -Dspark.serializer=org.apache.spark.serializer.JavaSerializer
-Dspark.locality.wait=60000000 -Dspark.shuffle.manager=org.apache.spark.shuffle.hash.HashShuffleManager
Options: aggregate-by-key-naive --num-trials=10 --inter-trial-wait=3 --num-partitions=400
--reduce-tasks=400 --random-seed=5 --persistent-type=memory  --num-records=200000000 --unique-keys=20000
--key-length=10 --unique-values=1000000 --value-length=10  --storage-location=hdfs://:9000/spark-perf-kv-data
{code}

I'll try to run a better set of tests today.  I plan to look at a few cases that these tests
didn't address, including the performance impact when running on spinning disks, as well as
jobs where we have a large dataset with few mappers and reducers (I think this is the case
that we'd expect to be most favorable to hash-based shuffle).

> Made sort-based shuffle the default implementation
> --------------------------------------------------
>
>                 Key: SPARK-3280
>                 URL: https://issues.apache.org/jira/browse/SPARK-3280
>             Project: Spark
>          Issue Type: Improvement
>            Reporter: Reynold Xin
>            Assignee: Reynold Xin
>
> sort-based shuffle has lower memory usage and seems to outperform hash-based in almost
all of our testing.



--
This message was sent by Atlassian JIRA
(v6.2#6252)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message