spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From marmbrus <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-10735] [SQL] Generate aggregation w/o g...
Date Tue, 19 Jan 2016 19:07:19 GMT
Github user marmbrus commented on a diff in the pull request:

    https://github.com/apache/spark/pull/10786#discussion_r50157926
  
    --- Diff: sql/core/src/test/scala/org/apache/spark/sql/execution/BenchmarkWholeStageCodegen.scala
---
    @@ -46,15 +47,64 @@ class BenchmarkWholeStageCodegen extends SparkFunSuite {
     
         /*
           Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    -      Single Int Column Scan:      Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    -      -------------------------------------------------------------------------
    -      Without whole stage codegen       6725.52            31.18         1.00 X
    -      With whole stage codegen          2233.05            93.91         3.01 X
    +      Single Int Column Scan:            Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    +      -------------------------------------------------------------------------------
    +      Without whole stage codegen             6585.36            31.85         1.00 X
    +      With whole stage codegen                 343.80           609.99        19.15 X
    +    */
    +    benchmark.run()
    +  }
    +
    +  def testImperitaveAggregation(values: Int): Unit = {
    +
    +    val benchmark = new Benchmark("aggregation", values)
    +
    +    benchmark.addCase("ImpAgg w/o whole stage codegen") { iter =>
    +      sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
    +      sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
    +    }
    +
    +    benchmark.addCase("DeclAgg w/o whole stage codegen") { iter =>
    +      sqlContext.setConf("spark.sql.codegen.wholeStage", "false")
    +      sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
    +    }
    +
    +    benchmark.addCase("ImpAgg w whole stage codegen") { iter =>
    +      sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
    +      sqlContext.range(values).groupBy().agg("id" -> "stddev").collect()
    +    }
    +
    +    benchmark.addCase("DeclAgg w whole stage codegen") { iter =>
    +      sqlContext.setConf("spark.sql.codegen.wholeStage", "true")
    +      sqlContext.range(values).groupBy().agg("id" -> "stddev1").collect()
    +    }
    +
    +    /*
    +      Before optimizing CentralMomentAgg and generated mutable projection:
    +
    +      Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    +      aggregation:                       Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    +      -------------------------------------------------------------------------------
    +      ImpAgg w/o whole stage codegen          9047.35            11.59         1.00 X
    +      DeclAgg w/o whole stage codegen         6507.27            16.11         1.39 X
    +      ImpAgg w whole stage codegen            6947.30            15.09         1.30 X
    +      DeclAgg w whole stage codegen           1376.74            76.16         6.57 X
    +
    +      After optimization:
    +
    +      Intel(R) Core(TM) i7-4558U CPU @ 2.80GHz
    +      aggregation:                       Avg Time(ms)    Avg Rate(M/s)  Relative Rate
    +      -------------------------------------------------------------------------------
    +      ImpAgg w/o whole stage codegen          6159.03            17.03         1.00 X
    +      DeclAgg w/o whole stage codegen         5248.69            19.98         1.17 X
    +      ImpAgg w whole stage codegen            4202.30            24.95         1.47 X
    +      DeclAgg w whole stage codegen           1367.34            76.69         4.50 X
    --- End diff --
    
    I've always preferred declarative aggregate because its much easier to optimize (very
cool the kind of speed ups you are getting!).  As such, I'd support having all of our built
in functions done this way.
    
    @mengxr argues that its too confusing for users and that we should also support the imperative
one.  How high cost is this for us?


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message