spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From a-roberts <...@git.apache.org>
Subject [GitHub] spark pull request #16196: [SPARK-18231] Optimise SizeEstimator implementati...
Date Wed, 07 Dec 2016 10:42:49 GMT
GitHub user a-roberts opened a pull request:

    https://github.com/apache/spark/pull/16196

    [SPARK-18231] Optimise SizeEstimator implementation

    ## What changes were proposed in this pull request?
    
    Several improvements to the SizeEstimator for performance, most of the benefit comes from,
when estimating, contending to not contending on multiple threads. There can be a small boost
in uncontended scenarios from the removal of the synchronisation code but the cost of that
synchronisation when not truly contended is low. On the PageRank workload for HiBench we see
10-15% performance improvements (measuring elapsed times on average) with both IBM's SDK for
Java and OpenJDK 8. I don't see any changes other than noise for the other workloads on this
benchmark.
    
    ## How was this patch tested?
    
    Existing unit tests but there are problems to resolve.
    
    I see SizeEstimatorSuite and SizeTrackerSuite failing with at least IBM Java now due to
smaller sizes being reported than the test expects (let's see what happens with OpenJDK on
the community runs). 
    
    In SizeTrackerSuite I think the failures are caused by using ThreadLocalRandom and not
Random - because with Random we see these tests passing again. Not sure how robust SizeTrackerSuite
is though.
    
    For performance testing I've used HiBench, large profile, with one executor ranging from
10g to 25g, experimenting with fixed and dynamic heaps. The Spark code I've based my results
on is from December the 1st (master branch, so 2.1.0 snapshot).
    
    More details on the optimisations (this being phase one and JDK agnostic) at www.spark.tc/improvements-to-the-sizeestimator-class

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/a-roberts/spark patch-12

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/16196.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #16196
    
----
commit 50af8fc224cb5acb19a7b55d31ee92b44c96f26f
Author: Adam Roberts <aroberts@uk.ibm.com>
Date:   2016-12-07T10:32:37Z

    [SPARK-18231] Optimise SizeEstimator implementation
    
    Several improvements to the SizeEstimator for performance, most of the benefit comes from,
when estimating, contending to not contending on multiple threads. There can be a small boost
in uncontended scenarios from the removal of the synchronisation code but the cost of that
synchronisation when not truly contended is low. On the PageRank workload for HiBench we see
49~ second durations reduced to ~41 second durations. I don't see any changes for other workloads.
Observed with both IBM's SDK for Java and OpenJDK.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message