spark-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Michael Allman <mich...@videoamp.com>
Subject Re: Spark 2.0.0 performance; potential large Spark core regression
Date Fri, 08 Jul 2016 15:44:19 GMT
Hi Adam,

Do you have your spark confs and your spark-env.sh somewhere where we can see them? If not,
can you make them available?

Cheers,

Michael

> On Jul 8, 2016, at 3:17 AM, Adam Roberts <aroberts@uk.ibm.com> wrote:
> 
> Hi, we've been testing the performance of Spark 2.0 compared to previous releases, unfortunately
there are no Spark 2.0 compatible versions of HiBench and SparkPerf apart from those I'm working
on (see https://github.com/databricks/spark-perf/issues/108 <https://github.com/databricks/spark-perf/issues/108>)

> 
> With the Spark 2.0 version of SparkPerf we've noticed a 30% geomean regression with a
very small scale factor and so we've generated a couple of profiles comparing 1.5.2 vs 2.0.0.
Same JDK version and same platform. We will gather a 1.6.2 comparison and increase the scale
factor. 
> 
> Has anybody noticed a similar problem? My changes for SparkPerf and Spark 2.0 are very
limited and AFAIK don't interfere with Spark core functionality, so any feedback on the changes
would be much appreciated and welcome, I'd much prefer it if my changes are the problem. 
> 
> A summary for your convenience follows (this matches what I've mentioned on the SparkPerf
issue above) 
> 
> 1. spark-perf/config/config.py : SCALE_FACTOR=0.05
> No. Of Workers: 1
> Executor per Worker : 1
> Executor Memory: 18G
> Driver Memory : 8G
> Serializer: kryo 
> 
> 2. $SPARK_HOME/conf/spark-defaults.conf: executor Java Options: -Xdisableexplicitgc -Xcompressedrefs

> 
> Main changes I made for the benchmark itself
> Use Scala 2.11.8 and Spark 2.0.0 RC2 on our local filesystem
> MLAlgorithmTests use Vectors.fromML
> For streaming-tests in HdfsRecoveryTest we use wordStream.foreachRDD not wordStream.foreach
> KVDataTest uses awaitTerminationOrTimeout in a SparkStreamingContext instead of awaitTermination
> Trivial: we use compact not compact.render for outputting json
> 
> In Spark 2.0 the top five methods where we spend our time is as follows, the percentage
is how much of the overall processing time was spent in this particular method: 
> 1.        AppendOnlyMap.changeValue 44% 
> 2.        SortShuffleWriter.write 19% 
> 3.        SizeTracker.estimateSize 7.5% 
> 4.        SizeEstimator.estimate 5.36% 
> 5.        Range.foreach 3.6% 
> 
> and in 1.5.2 the top five methods are: 
> 1.        AppendOnlyMap.changeValue 38% 
> 2.        ExternalSorter.insertAll 33% 
> 3.        Range.foreach 4% 
> 4.        SizeEstimator.estimate 2% 
> 5.        SizeEstimator.visitSingleObject 2% 
> 
> I see the following scores, on the left I have the test name followed by the 1.5.2 time
and then the 2.0.0 time
> scheduling throughput: 5.2s vs 7.08s
> agg by key; 0.72s vs 1.01s
> agg by key int: 0.93s vs 1.19s
> agg by key naive: 1.88s vs 2.02
> sort by key: 0.64s vs 0.8s
> sort by key int: 0.59s vs 0.64s
> scala count: 0.09s vs 0.08s
> scala count w fltr: 0.31s vs 0.47s 
> 
> This is only running the Spark core tests (scheduling throughput through scala-count-w-filtr,
including all inbetween). 
> 
> Cheers, 
> 
> 
> Unless stated otherwise above:
> IBM United Kingdom Limited - Registered in England and Wales with number 741598. 
> Registered office: PO Box 41, North Harbour, Portsmouth, Hampshire PO6 3AU


Mime
View raw message