spark-reviews mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From viirya <...@git.apache.org>
Subject [GitHub] spark issue #14452: [SPARK-16849][SQL] Improve subquery execution by dedupli...
Date Thu, 01 Sep 2016 02:39:29 GMT
Github user viirya commented on the issue:

    https://github.com/apache/spark/pull/14452
  
    @hvanhovell Let me try to explain this with an example.
    
        WITH cte AS (SELECT * FROM src) SELECT * FROM cte a JOIN cte b
    
    In above query, the common subquery `cte` will be executed twice. We find such common
subqueries and wrap the executed plan of it into `CommonSubquery` node. These common subqueries
which have the same results, will share the same executed plan and the same variable of computed
results.
    
    In planning, we create `CommonSubqueryExec` for `CommonSubquery`. When `CommonSubqueryExec.doExecute`
is called to materialized the results, we delegate to the executed plan wrapped in `CommonSubquery`
and keep its results. As all common subqueries share the same executed plan and the variable
of computed results, the later calling on `CommonSubqueryExec.doExecute` can directly take
the computed results.
    
    We benchmark this patch on TPC-DS queries and see significant improvement on many queries
which use CTE subqueries. We are trying to solve some filter pushdown issues and improve it
further.
    
    Please let me know if it is clear for you.



---
If your project is set up for it, you can reply to this email and have your
reply appear on GitHub as well. If your project does not have this feature
enabled and wishes so, or if the feature is enabled but not working, please
contact infrastructure at infrastructure@apache.org or file a JIRA ticket
with INFRA.
---

---------------------------------------------------------------------
To unsubscribe, e-mail: reviews-unsubscribe@spark.apache.org
For additional commands, e-mail: reviews-help@spark.apache.org


Mime
View raw message