hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ashutosh Chauhan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-14708) Optimizer: NOT IN query scans one input two times
Date Tue, 06 Sep 2016 22:22:21 GMT

    [ https://issues.apache.org/jira/browse/HIVE-14708?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15468804#comment-15468804
] 

Ashutosh Chauhan commented on HIVE-14708:
-----------------------------------------

I am not sure I follow. We need to know if there are any nulls in subqueries or not and we
need to know that before we start generating resultset for outer query. Thus we count null
first and then use that count to do a filter. I am not sure how is it possible to do this
in a single scan. For reference: this rewrite is described on page 10 on [design doc | https://issues.apache.org/jira/secure/attachment/12614003/SubQuerySpec.pdf]
on HIVE-784

> Optimizer: NOT IN query scans one input two times
> -------------------------------------------------
>
>                 Key: HIVE-14708
>                 URL: https://issues.apache.org/jira/browse/HIVE-14708
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 2.2.0
>            Reporter: Gopal V
>            Priority: Critical
>
> {code}
> hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where
ss_sold_date_sk NOT in (select d_date_sk from date_dim);
>     Stage-1
>       Reducer 2 vectorized, llap
>       File Output Operator [FS_52]
>         Group By Operator [GBY_51] (rows=1 width=8)
>           Output:["_col0"],aggregations:["count(VALUE._col0)"]
>         <-Map 1 [SIMPLE_EDGE] vectorized, llap
>           SHUFFLE [RS_50]
>             Group By Operator [GBY_49] (rows=1 width=8)
>               Output:["_col0"],aggregations:["count(1)"]
>               Select Operator [SEL_48] (rows=1 width=4)
>                 Filter Operator [FIL_47] (rows=1 width=4)
>                   predicate:_col2 is null
>                   Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
>                     Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"]
>                   <-Map 5 [BROADCAST_EDGE] vectorized, llap
>                     BROADCAST [RS_43]
>                       PartitionCols:_col0
>                       Select Operator [SEL_42] (rows=73049 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_11] (rows=73049 width=4)
>                           tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                   <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
>                       Conds:(Inner),Output:["_col0"]
>                     <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
>                       BROADCAST [RS_41]
>                         Select Operator [SEL_40] (rows=1 width=8)
>                           Filter Operator [FIL_39] (rows=1 width=8)
>                             predicate:(_col0 = 0)
>                             Group By Operator [GBY_38] (rows=1 width=8)
>                               Output:["_col0"],aggregations:["count(VALUE._col0)"]
>                             <-Map 3 [SIMPLE_EDGE] vectorized, llap
>                               SHUFFLE [RS_37]
>                                 Group By Operator [GBY_36] (rows=1 width=8)
>                                   Output:["_col0"],aggregations:["count()"]
>                                   Select Operator [SEL_35] (rows=1 width=4)
>                                     Filter Operator [FIL_34] (rows=1 width=4)
>                                       predicate:d_date_sk is null
>                                       TableScan [TS_2] (rows=73049 width=4)
>                                         tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
>                     <-Select Operator [SEL_44] (rows=2879987999 width=4)
>                         Output:["_col0"]
>                         TableScan [TS_0] (rows=2879987999 width=92)
>                           tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE
> {code}
> The 2nd scan is merely to count the number of NULLs and has {{predicate:d_date_sk is
null}} in the operator.
> The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2
independent full-scans of the date_dim table.
> This is not significant in a scenario like the above where the small table side is an
actual HDFS table, but entirely throttles performance when the small side is actually an expensive
aggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message