spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Matei Zaharia <matei.zaha...@gmail.com>
Subject Re: Spark shuffle consolidateFiles performance degradation numbers
Date Tue, 04 Nov 2014 02:28:09 GMT
In Spark 1.1, the sort-based shuffle (spark.shuffle.manager=sort) will have better performance
while creating fewer files. So I'd suggest trying that too.

Matei

> On Nov 3, 2014, at 6:12 PM, Andrew Or <andrew@databricks.com> wrote:
> 
> Hey Matt,
> 
> There's some prior work that compares consolidation performance on some
> medium-scale workload:
> http://www.cs.berkeley.edu/~kubitron/courses/cs262a-F13/projects/reports/project16_report.pdf
> 
> There we noticed about 2x performance degradation in the reduce phase on
> ext3. I am not aware of any other concrete numbers. Maybe others have more
> experiences to add.
> 
> -Andrew
> 
> 2014-11-03 17:26 GMT-08:00 Matt Cheah <mcheah@palantir.com>:
> 
>> Hi everyone,
>> 
>> I'm running into more and more cases where too many files are opened when
>> spark.shuffle.consolidateFiles is turned off.
>> 
>> I was wondering if this is a common scenario among the rest of the
>> community, and if so, if it is worth considering the setting to be turned
>> on by default. From the documentation, it seems like the performance could
>> be hurt on ext3 file systems. However, what are the concrete numbers of
>> performance degradation that is seen typically? A 2x slowdown in the
>> average job? 3x? Also, what cause the performance degradation on ext3 file
>> systems specifically?
>> 
>> Thanks,
>> 
>> -Matt Cheah
>> 
>> 
>> 


---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@spark.apache.org
For additional commands, e-mail: dev-help@spark.apache.org


Mime
View raw message