spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yadid Ayzenberg <ya...@media.mit.edu>
Subject spark performance non-linear response
Date Wed, 07 Oct 2015 15:26:24 GMT
Hi All,

Im using spark 1.4.1 to to analyze a largish data set (several Gigabytes 
of data). The RDD is partitioned into 2048 partitions which are more or 
less equal and entirely cached in RAM.
I evaluated the performance on several cluster sizes, and am witnessing 
a non linear (power) performance improvement as the cluster size 
increases (plot below). Each node has 4 cores and each worker is 
configured to use 10GB or RAM.

Spark performance

I would expect a more linear response given the number of partitions and 
the fact that all of the data is cached.
Can anyone suggest what I should tweak in order to improve the performance?
Or perhaps provide an explanation as to the behavior Im witnessing?

Yadid

Mime
View raw message