hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]
Date Tue, 09 May 2017 06:08:04 GMT

    [ https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16002113#comment-16002113
] 

liyunzhang_intel commented on HIVE-11297:
-----------------------------------------

the explain plan of the multiple columns single source case in spark_dynamic_partition_pruning.q
is 
{code}
-- multiple columns single source
EXPLAIN select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
{code}

the explain plan is 
{code}
STAGE DEPENDENCIES:
  Stage-2 is a root stage
  Stage-1 depends on stages: Stage-2
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            target column name: ds
                            target work: Map 1
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            target column name: hr
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL SORT, 2)
        Reducer 3 <- Reducer 2 (GROUP, 1)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column
stats: NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column
stats: NONE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: string)
                      sort order: ++
                      Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                      Statistics: Num rows: 2000 Data size: 21248 Basic stats: COMPLETE Column
stats: NONE
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col2 (type: string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), _col2 (type: string)
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
        Reducer 2 
            Reduce Operator Tree:
              Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col2 (type: string)
                Statistics: Num rows: 2200 Data size: 23372 Basic stats: COMPLETE Column stats:
NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                    value expressions: _col0 (type: bigint)
        Reducer 3 
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink
{code}

Comparing Map5,Map6,actually they are similar except Spark Partition Pruning Sink Operator).

the query in tez, there is only 1 Map(Map4) contains {{Dynamic Partitioning Event Operator}}
{code}

STAGE DEPENDENCIES:
  Stage-1 is a root stage
  Stage-0 depends on stages: Stage-1

STAGE PLANS:
  Stage: Stage-1
    Tez
#### A masked pattern was here ####
      Edges:
        Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE)
        Reducer 3 <- Reducer 2 (SIMPLE_EDGE)
#### A masked pattern was here ####
      Vertices:
        Map 1 
            Map Operator Tree:
                TableScan
                  alias: srcpart
                  Statistics: Num rows: 2000 Data size: 757248 Basic stats: COMPLETE Column
stats: COMPLETE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col1
                    Statistics: Num rows: 2000 Data size: 736000 Basic stats: COMPLETE Column
stats: COMPLETE
                    Reduce Output Operator
                      key expressions: _col0 (type: string), _col1 (type: string)
                      sort order: ++
                      Map-reduce partition columns: _col0 (type: string), _col1 (type: string)
                      Statistics: Num rows: 2000 Data size: 736000 Basic stats: COMPLETE Column
stats: COMPLETE
            Execution mode: llap
            LLAP IO: no inputs
        Map 4 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Reduce Output Operator
                        key expressions: _col0 (type: string), _col2 (type: string)
                        sort order: ++
                        Map-reduce partition columns: _col0 (type: string), _col2 (type: string)
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Dynamic Partitioning Event Operator
                            Target column: ds (string)
                            Target Input: srcpart
                            Partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            Target Vertex: Map 1
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Dynamic Partitioning Event Operator
                            Target column: hr (string)
                            Target Input: srcpart
                            Partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            Target Vertex: Map 1
            Execution mode: llap
            LLAP IO: no inputs
        Reducer 2 
            Execution mode: llap
            Reduce Operator Tree:
              Merge Join Operator
                condition map:
                     Inner Join 0 to 1
                keys:
                  0 _col0 (type: string), _col1 (type: string)
                  1 _col0 (type: string), _col2 (type: string)
                Statistics: Num rows: 2200 Data size: 809600 Basic stats: COMPLETE Column
stats: NONE
                Group By Operator
                  aggregations: count()
                  mode: hash
                  outputColumnNames: _col0
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  Reduce Output Operator
                    sort order: 
                    Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                    value expressions: _col0 (type: bigint)
        Reducer 3 
            Execution mode: llap
            Reduce Operator Tree:
              Group By Operator
                aggregations: count(VALUE._col0)
                mode: mergepartial
                outputColumnNames: _col0
                Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats: NONE
                File Output Operator
                  compressed: false
                  Statistics: Num rows: 1 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
                  table:
                      input format: org.apache.hadoop.mapred.SequenceFileInputFormat
                      output format: org.apache.hadoop.hive.ql.io.HiveSequenceFileOutputFormat
                      serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe

  Stage: Stage-0
    Fetch Operator
      limit: -1
      Processor Tree:
        ListSink

{code}




> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>
> Currently, for dynamic partition pruning in Spark, if a small table generates partition
info for more than one partition columns, multiple operator trees are created, which all start
from the same table scan op, but have different spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do table scan
multiple times.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message