spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From JoshRosen <...@git.apache.org>
Subject [GitHub] spark pull request: [SPARK-10988] [SQL] Reduce duplication in Aggr...
Date Wed, 07 Oct 2015 23:48:09 GMT
GitHub user JoshRosen opened a pull request:

    https://github.com/apache/spark/pull/9015

    [SPARK-10988] [SQL] Reduce duplication in Aggregate2's expression rewriting logic

    In `aggregate/utils.scala`, there is a substantial amount of duplication in the expression-rewriting
logic. As a prerequisite to supporting imperative aggregate functions in `TungstenAggregate`,
this patch refactors this file so that the same expression-rewriting logic is used for both
`SortAggregate` and `TungstenAggregate`.
    
    In order to allow both operators to use the same rewriting logic, `TungstenAggregationIterator.
generateResultProjection()` has been updated so that it first evaluates all declarative aggregate
functions' `evaluateExpression`s and writes the results into a temporary buffer, and then
uses this temporary buffer and the grouping expressions to evaluate the final resultExpressions.
This matches the logic in SortAggregateIterator, where this two-pass approach is necessary
in order to support imperative aggregates. If this change turns out to cause performance regressions,
then we can look into re-implementing the single-pass evaluation in a cleaner way as part
of a followup patch.
    
    Since the rewriting logic is now shared across both operators, this patch also extracts
that logic and places it in `SparkStrategies`. This makes the rewriting logic a bit easier
to follow, I think.

You can merge this pull request into a Git repository by running:

    $ git pull https://github.com/JoshRosen/spark SPARK-10988

Alternatively you can review and apply these changes as the patch at:

    https://github.com/apache/spark/pull/9015.patch

To close this pull request, make a commit to your master/trunk branch
with (at least) the following in the commit message:

    This closes #9015
    
----
commit ee84e946a82d6de4eb49d54fb3be534a18285dc7
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:28:26Z

    Unify result expression rewriting for both SortAggregate and TungstenAggregate.

commit af035d210b6596f814e9a2abbcf45c75d6403cea
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:39:53Z

    Simplify type of aggregateFunctionMap.

commit d5b7318760e5b340abb8e4cb748a7c11925c411b
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:41:09Z

    Rename aggreagteFunctionMap to aggregateFunctionToAttribute

commit a888351e21b15ee2439e11c9f654b7280532b7c6
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:45:04Z

    Expand the comment on only evaluating each agg. func. once.

commit 4bac641fae8600d30aed044752fc4b4fa81be696
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:46:23Z

    transform -> transformDown

commit 7cb82b5891ee042817b4bf6bbda22d5e3e9a4dd5
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T21:54:06Z

    Remove now-unused rewrittenAggregateFunctions.

commit fb0daa1b749620908edabc925ac493739b1dce64
Author: Josh Rosen <joshrosen@databricks.com>
Date:   2015-10-07T22:24:35Z

    Extract common rewriting logic.

----


---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message