spark-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Shawn Lavelle (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (SPARK-19730) Predicate Subqueries do not push results of subqueries to data source
Date Fri, 24 Feb 2017 19:37:44 GMT

     [ https://issues.apache.org/jira/browse/SPARK-19730?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Shawn Lavelle updated SPARK-19730:
----------------------------------
    Description: 
When a SparkSQL query contains a subquery in the where clause, such as a predicate query using
the IN operator, the results of that subquery are not pushed down as a fileter to the DataSourceAPI
for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() and point in
(select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another for the outer
query. The subquery works correctly returning the points in the group, however, outer query
does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand points.  The data
table has several billion rows keyed by point and time.  Without the ability to push down
the filters for the columns of outer the query, the data source cannot properly conduct its
pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter with the results
of the subquery.

{panel:title=Physical Plan}
*Project [point#263, value#270]
+- SortMergeJoin [point#263], [col#284], LeftSemi
   :- *Sort [point#263 ASC NULLS FIRST], false, 0
   :  +- Exchange hashpartitioning(point#263, 20)
   :     +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 1487964696000))
   :        +- *Scan @4b455128 <db>.data[point#263,time#264Lvalue#270] *PushedFilters:
[GreaterThanOrEqual(time,1487964691000), LessThanOrEqual(time,1487964691000)*, ReadSchema:
struct<point:int,time:bigint,value:double>...
   +- *Sort [col#284 ASC NULLS FIRST], false, 0
      +- Exchange hashpartitioning(col#284, 20)
         +- Generate explode(points#273), false, false, [col#284]
            +- *Project [points#273]
               +- *Filter (group_id#272 = 1)
                  +- *Scan @12fb3c1a <db>.groups[points#273,group_id#272] PushedFilters:
[EqualTo(group_id,1)], ReadSchema: struct<points:array<int>>  |
{panel}

  was:
When a SparkSQL query contains a subquery in the where clause, such as a predicate query using
the IN operator, the results of that subquery are not pushed down as a fileter to the DataSourceAPI
for the outer query. 

Example: 
Select point, time, value from data where time between now()-86400 and now() and point in
(select point from groups where group_id=5);

Two queries will be sent to the data Source.  One for the subquery, and another for the outer
query. The subquery works correctly returning the points in the group, however, outer query
does not push a filter for point column.

Affect:
The "group" table has a few hundred rows to group a few hundred thousand points.  The data
table has several billion rows keyed by point and time.  Without the ability to push down
the filters for the columns of outer the query, the data source cannot properly conduct its
pruned scan.

The subquery results should be pushed down to the outer query as an IN Filter with the results
of the subquery.


> Predicate Subqueries do not push results of subqueries to data source
> ---------------------------------------------------------------------
>
>                 Key: SPARK-19730
>                 URL: https://issues.apache.org/jira/browse/SPARK-19730
>             Project: Spark
>          Issue Type: Bug
>          Components: Optimizer, SQL
>    Affects Versions: 2.1.0
>            Reporter: Shawn Lavelle
>
> When a SparkSQL query contains a subquery in the where clause, such as a predicate query
using the IN operator, the results of that subquery are not pushed down as a fileter to the
DataSourceAPI for the outer query. 
> Example: 
> Select point, time, value from data where time between now()-86400 and now() and point
in (select point from groups where group_id=5);
> Two queries will be sent to the data Source.  One for the subquery, and another for the
outer query. The subquery works correctly returning the points in the group, however, outer
query does not push a filter for point column.
> Affect:
> The "group" table has a few hundred rows to group a few hundred thousand points.  The
data table has several billion rows keyed by point and time.  Without the ability to push
down the filters for the columns of outer the query, the data source cannot properly conduct
its pruned scan.
> The subquery results should be pushed down to the outer query as an IN Filter with the
results of the subquery.
> {panel:title=Physical Plan}
> *Project [point#263, value#270]
> +- SortMergeJoin [point#263], [col#284], LeftSemi
>    :- *Sort [point#263 ASC NULLS FIRST], false, 0
>    :  +- Exchange hashpartitioning(point#263, 20)
>    :     +- *Filter ((time#264L >= 1487964691000) && (time#264L <= 1487964696000))
>    :        +- *Scan @4b455128 <db>.data[point#263,time#264Lvalue#270] *PushedFilters:
[GreaterThanOrEqual(time,1487964691000), LessThanOrEqual(time,1487964691000)*, ReadSchema:
struct<point:int,time:bigint,value:double>...
>    +- *Sort [col#284 ASC NULLS FIRST], false, 0
>       +- Exchange hashpartitioning(col#284, 20)
>          +- Generate explode(points#273), false, false, [col#284]
>             +- *Project [points#273]
>                +- *Filter (group_id#272 = 1)
>                   +- *Scan @12fb3c1a <db>.groups[points#273,group_id#272] PushedFilters:
[EqualTo(group_id,1)], ReadSchema: struct<points:array<int>>  |
> {panel}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: issues-unsubscribe@spark.apache.org
For additional commands, e-mail: issues-help@spark.apache.org


Mime
View raw message