spark-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yadid Ayzenberg <ya...@media.mit.edu>
Subject Re: spark performance non-linear response
Date Thu, 08 Oct 2015 11:39:38 GMT
sending images as normal attachments.

Im starting to measure after the RDD has been cached. Im running 5 
measurements and averaging (in each of the cluster sizes).
so I don't think data locality has anything to do with it.






On 10/8/15 4:52 AM, Sean Owen wrote:
> (The images don't show up on this mailing list. At least, they aren't 
> loading for me. But I can guess what it looks like.)
>
> Hm, my next guess is data locality. If your data is on a fixed set of 
> nodes, and you keep adding executors, 1 per node, then you're adding 
> more nodes, right? more and more work is not local to any copy of the 
> data and has to be transferred. This could start to put a floor under 
> the time the computation can possibly take, as in the limit it is all 
> copied somewhere else before starting.
>
> On Wed, Oct 7, 2015 at 11:24 PM, Yadid Ayzenberg <yadid@media.mit.edu 
> <mailto:yadid@media.mit.edu>> wrote:
>
>     here is the distribution of the partition sizes:
>
>
>     distribution
>
>     and a distribution of the executor memory sizes:
>
>     exec memory distribution
>     It seems they are pretty well balanced in terms of sizes. Im
>     running one executor per node (utilizing 4 cores).
>
>
>
>     On 10/7/15 11:45 AM, Sean Owen wrote:
>>     OK, next question then is: if this is wall-clock time for the
>>     whole process, then, I wonder if you are just measuring the time
>>     taken by the longest single task. I'd expect the time taken by
>>     the longest straggler task to follow a distribution like this.
>>     That is, how balanced are the partitions?
>>
>>     Are you running so many executors that nodes are bottlenecking on
>>     CPU, or swapping?
>>
>>
>>     On Wed, Oct 7, 2015 at 4:42 PM, Yadid Ayzenberg
>>     <yadid@media.mit.edu <mailto:yadid@media.mit.edu>> wrote:
>>
>>         Additional missing relevant information:
>>
>>         Im running a transformation, there are no Shuffles occurring
>>         and at the end im performing a lookup of 4 partitions on the
>>         driver.
>>
>>
>>
>>
>>         On 10/7/15 11:26 AM, Yadid Ayzenberg wrote:
>>>         Hi All,
>>>
>>>         Im using spark 1.4.1 to to analyze a largish data set
>>>         (several Gigabytes of data). The RDD is partitioned into
>>>         2048 partitions which are more or less equal and entirely
>>>         cached in RAM.
>>>         I evaluated the performance on several cluster sizes, and am
>>>         witnessing a non linear (power) performance improvement as
>>>         the cluster size increases (plot below). Each node has 4
>>>         cores and each worker is configured to use 10GB or RAM.
>>>
>>>         Spark performance
>>>
>>>         I would expect a more linear response given the number of
>>>         partitions and the fact that all of the data is cached.
>>>         Can anyone suggest what I should tweak in order to improve
>>>         the performance?
>>>         Or perhaps provide an explanation as to the behavior Im
>>>         witnessing?
>>>
>>>         Yadid
>>
>>
>
>


Mime
View raw message