hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Gopal V (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-14708) Optimizer: NOT IN query scans one input two times
Date Tue, 06 Sep 2016 17:37:20 GMT
Gopal V created HIVE-14708:
------------------------------

             Summary: Optimizer: NOT IN query scans one input two times
                 Key: HIVE-14708
                 URL: https://issues.apache.org/jira/browse/HIVE-14708
             Project: Hive
          Issue Type: Bug
          Components: Logical Optimizer
    Affects Versions: 2.2.0
            Reporter: Gopal V
            Priority: Critical


{code}
hive (tpcds_bin_partitioned_orc_1000)> explain select count(1) from store_sales where ss_sold_date_sk
NOT in (select d_date_sk from date_dim);

    Stage-1
      Reducer 2 vectorized, llap
      File Output Operator [FS_52]
        Group By Operator [GBY_51] (rows=1 width=8)
          Output:["_col0"],aggregations:["count(VALUE._col0)"]
        <-Map 1 [SIMPLE_EDGE] vectorized, llap
          SHUFFLE [RS_50]
            Group By Operator [GBY_49] (rows=1 width=8)
              Output:["_col0"],aggregations:["count(1)"]
              Select Operator [SEL_48] (rows=1 width=4)
                Filter Operator [FIL_47] (rows=1 width=4)
                  predicate:_col2 is null
                  Map Join Operator [MAPJOIN_46] (rows=2879987999 width=4)
                    Conds:MAPJOIN_45._col0=RS_43._col0(Left Outer),Output:["_col2"]
                  <-Map 5 [BROADCAST_EDGE] vectorized, llap
                    BROADCAST [RS_43]
                      PartitionCols:_col0
                      Select Operator [SEL_42] (rows=73049 width=4)
                        Output:["_col0"]
                        TableScan [TS_11] (rows=73049 width=4)
                          tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                  <-Map Join Operator [MAPJOIN_45] (rows=2879987999 width=4)
                      Conds:(Inner),Output:["_col0"]
                    <-Reducer 4 [BROADCAST_EDGE] vectorized, llap
                      BROADCAST [RS_41]
                        Select Operator [SEL_40] (rows=1 width=8)
                          Filter Operator [FIL_39] (rows=1 width=8)
                            predicate:(_col0 = 0)
                            Group By Operator [GBY_38] (rows=1 width=8)
                              Output:["_col0"],aggregations:["count(VALUE._col0)"]
                            <-Map 3 [SIMPLE_EDGE] vectorized, llap
                              SHUFFLE [RS_37]
                                Group By Operator [GBY_36] (rows=1 width=8)
                                  Output:["_col0"],aggregations:["count()"]
                                  Select Operator [SEL_35] (rows=1 width=4)
                                    Filter Operator [FIL_34] (rows=1 width=4)
                                      predicate:d_date_sk is null
                                      TableScan [TS_2] (rows=73049 width=4)
                                        tpcds_bin_partitioned_orc_1000@date_dim,date_dim,Tbl:COMPLETE,Col:COMPLETE,Output:["d_date_sk"]
                    <-Select Operator [SEL_44] (rows=2879987999 width=4)
                        Output:["_col0"]
                        TableScan [TS_0] (rows=2879987999 width=92)
                          tpcds_bin_partitioned_orc_1000@store_sales,store_sales,Tbl:COMPLETE,Col:COMPLETE

{code}

The 2nd scan is merely to count the number of NULLs and has {{predicate:d_date_sk is null}}
in the operator.

The NULL checks can be done inline with the NOT-NULL codepath instead of producing 2 independent
full-scans of the date_dim table.

This is not significant in a scenario like the above where the small table side is an actual
HDFS table, but entirely throttles performance when the small side is actually an expensive
aggregate.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message