hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Rajesh Balamohan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-17082) Dynamic semi join gets turned off at compile time
Date Thu, 13 Jul 2017 12:42:00 GMT
Rajesh Balamohan created HIVE-17082:
---------------------------------------

             Summary: Dynamic semi join gets turned off at compile time
                 Key: HIVE-17082
                 URL: https://issues.apache.org/jira/browse/HIVE-17082
             Project: Hive
          Issue Type: Bug
            Reporter: Rajesh Balamohan


With Hive-master:
=================

{noformat}


2017-07-13T08:35:55,042 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.DynamicPartitionPruningOptimization:
Initiate semijoin reduction for sr_ticket_number ((sr_ticket_number is not null and (sr_ticket_number)
IN (RS[6]))
2017-07-13T08:35:55,043 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.DynamicPartitionPruningOptimization:
DynamicSemiJoinPushdown: Saving RS to TS mapping: RS[28]: TS[3]
2017-07-13T08:35:55,398 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] optimizer.ConvertJoinMapJoin:
Found semijoin optimization from the big table side of a map join, which will cause a task
cycle. Removing semijoin RS[28] - TS[3] (store_returns)
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler:
Computing key domain cardinality, keyDomainCardinality=95121413, semiJoinKeyIsPK=false, selColStat=
colName: _col0 colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues:
0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, selColSourceStat= colName:
sr_ticket_number colType: bigint countDistincts: 8362530 numNulls: 0 avgColLen: 8.0 numTrues:
0 numFalses: 0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false, tsColStat= colName: ss_ticket_number
colType: bigint countDistincts: 86758883 numNulls: 0 avgColLen: 8.0 numTrues: 0 numFalses:
0 Range: [ min: 1 max: 240000000 ] isPrimaryKey: false
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler:
SemiJoin key selectivity=0.08791427436007496, benefit=2.6267959439021907E9
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler:
BloomFilter benefit=2.6267959439021907E9, cost=2.87999764E8, tsDataSize=2879987999, netBenefit=2.3387961799021907E9
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler:
netBenefit=0.8120853908815856
2017-07-13T08:35:55,400 DEBUG [056200f2-a53f-4f38-a9e7-8bb411c73349 main] parse.TezCompiler:
Semijoin optimization with parallel edge to map join. Removing semijoin RS[23] - TS[0] (store_sales)

> explain select count(1) from store_sales, store_returns where sr_ticket_number = ss_ticket_number;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: rbalamohan_20170713083602_0ed509c0-0311-480e-a01c-bafcb259a5fe:3
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: ss_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE
Column stats: COMPLETE
                  Filter Operator
                    predicate: ss_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE
Column stats: COMPLETE
                    Select Operator
                      expressions: ss_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats:
COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 9560241388 Data size: 76481931104 Basic stats:
COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: count()
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: COMPLETE
                            value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
        Map 3
            Map Operator Tree:
                TableScan
                  alias: store_returns
                  filterExpr: sr_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                  Filter Operator
                    predicate: sr_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                    Select Operator
                      expressions: sr_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats:
COMPLETE Column stats: COMPLETE
            Execution mode: vectorized, llap
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink



{noformat}

Without TezCompiler::removeSemijoinsParallelToMapJoin:
======================================================

Semi join gets invoked

{noformat}


 > explain select count(1) from store_sales, store_returns where sr_ticket_number = ss_ticket_number;
OK
STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
      DagId: rbalamohan_20170713082329_4c868b9a-6113-4da8-8c9a-66d9018e45c0:6
      Edges:
        Map 1 <- Map 3 (BROADCAST_EDGE), Reducer 4 (BROADCAST_EDGE)
        Reducer 2 <- Map 1 (CUSTOM_SIMPLE_EDGE)
        Reducer 4 <- Map 3 (CUSTOM_SIMPLE_EDGE)
      DagName:
      Vertices:
        Map 1
            Map Operator Tree:
                TableScan
                  alias: store_sales
                  filterExpr: (ss_ticket_number is not null and (ss_ticket_number BETWEEN
DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND DynamicValue(RS_7_store_returns_sr_ticket_number_max)
and in_bloom_filter(ss_ticket_number, DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter))))
(type: boolean)
                  Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE
Column stats: COMPLETE
                  Filter Operator
                    predicate: (ss_ticket_number is not null and (ss_ticket_number BETWEEN
DynamicValue(RS_7_store_returns_sr_ticket_number_min) AND DynamicValue(RS_7_store_returns_sr_ticket_number_max)
and in_bloom_filter(ss_ticket_number, DynamicValue(RS_7_store_returns_sr_ticket_number_bloom_filter))))
(type: boolean)
                    Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats: COMPLETE
Column stats: COMPLETE
                    Select Operator
                      expressions: ss_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 2879987999 Data size: 23039903992 Basic stats:
COMPLETE Column stats: COMPLETE
                      Map Join Operator
                        condition map:
                             Inner Join 0 to 1
                        keys:
                          0 _col0 (type: bigint)
                          1 _col0 (type: bigint)
                        input vertices:
                          1 Map 3
                        Statistics: Num rows: 9560241388 Data size: 76481931104 Basic stats:
COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: count()
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column
stats: COMPLETE
                            value expressions: _col0 (type: bigint)
            Execution mode: vectorized, llap
        Map 3
            Map Operator Tree:
                TableScan
                  alias: store_returns
                  filterExpr: sr_ticket_number is not null (type: boolean)
                  Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                  Filter Operator
                    predicate: sr_ticket_number is not null (type: boolean)
                    Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                    Select Operator
                      expressions: sr_ticket_number (type: bigint)
                      outputColumnNames: _col0
                      Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats: COMPLETE
Column stats: COMPLETE
                      Reduce Output Operator
                        key expressions: _col0 (type: bigint)
                        sort order: +
                        Map-reduce partition columns: _col0 (type: bigint)
                        Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats:
COMPLETE Column stats: COMPLETE
                      Select Operator
                        expressions: _col0 (type: bigint)
                        outputColumnNames: _col0
                        Statistics: Num rows: 287999764 Data size: 2303998112 Basic stats:
COMPLETE Column stats: COMPLETE
                        Group By Operator
                          aggregations: min(_col0), max(_col0), bloom_filter(_col0, expectedEntries=16725060)
                          mode: hash
                          outputColumnNames: _col0, _col1, _col2
                          Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column
stats: COMPLETE
                          Reduce Output Operator
                            sort order:
                            Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column
stats: COMPLETE
                            value expressions: _col0 (type: bigint), _col1 (type: bigint),
_col2 (type: binary)
            Execution mode: vectorized, llap
        Reducer 2
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: COMPLETE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
COMPLETE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
        Reducer 4
            Execution mode: vectorized, llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: min(VALUE._col0), max(VALUE._col1), bloom_filter(VALUE._col2,
expectedEntries=16725060)
                mode: final
                outputColumnNames: _col0, _col1, _col2
                Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats:
COMPLETE
                Reduce Output Operator
                  sort order:
                  Statistics: Num rows: 1 Data size: 24 Basic stats: COMPLETE Column stats:
COMPLETE
                  value expressions: _col0 (type: bigint), _col1 (type: bigint), _col2 (type:
binary)

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{noformat}

Related ticket: HIVE-16260



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message