hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "liyunzhang_intel (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-11297) Combine op trees for partition info generating tasks [Spark branch]
Date Tue, 13 Jun 2017 07:01:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-11297?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

liyunzhang_intel updated HIVE-11297:
------------------------------------
    Attachment: HIVE-11297.3.patch

[~csun]:  update SplitOpTreeForDPP and to split the trees like what you mentioned last time.
because the explain plan is changed after this jira
{code}
set hive.execution.engine=spark; 
set hive.auto.convert.join.noconditionaltask.size=20; 
set hive.spark.dynamic.partition.pruning=true;
select count(*) from srcpart join srcpart_date_hour on (srcpart.ds = srcpart_date_hour.ds
and srcpart.hr = srcpart_date_hour.hr) where srcpart_date_hour.`date` = '2008-04-08' and srcpart_date_hour.hour
= 11;
{code}

before
{code}
STAGE PLANS:
  Stage: Stage-2
    Spark
#### A masked pattern was here ####
      Vertices:
        Map 5 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col0 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: ds
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            target column name: ds
                            target work: Map 1
        Map 6 
            Map Operator Tree:
                TableScan
                  alias: srcpart_date_hour
                  filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                  Filter Operator
                    predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: ds (type: string), hr (type: string)
                      outputColumnNames: _col0, _col2
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Select Operator
                        expressions: _col2 (type: string)
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Group By Operator
                          keys: _col0 (type: string)
                          mode: hash
                          outputColumnNames: _col0
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          Spark Partition Pruning Sink Operator
                            partition key expr: hr
                            Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                            target column name: hr
                            target work: Map 1

  Stage: Stage-1
    Spark
      Edges:
        Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL SORT, 2)
        Reducer 3 <- Reducer 2 (GROUP, 1)
{code}

now
{code}
Stage: Stage-2  Spark
#### A masked pattern was here ####
    Vertices:
      Map 5 
          Map Operator Tree:
              TableScan
                alias: srcpart_date_hour
                filterExpr: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds is
not null and hr is not null) (type: boolean)
                Statistics: Num rows: 4 Data size: 108 Basic stats: COMPLETE Column stats:
NONE
                Filter Operator
                  predicate: ((date = '2008-04-08') and (UDFToDouble(hour) = 11.0) and ds
is not null and hr is not null) (type: boolean)
                  Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                  Select Operator
                    expressions: ds (type: string), hr (type: string)
                    outputColumnNames: _col0, _col2
                    Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                    Select Operator
                      expressions: _col0 (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Group By Operator
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: ds
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          target column name: ds
                          target work: Map 1
                    Select Operator
                      expressions: _col2 (type: string)
                      outputColumnNames: _col0
                      Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column stats:
NONE
                      Group By Operator
                        keys: _col0 (type: string)
                        mode: hash
                        outputColumnNames: _col0
                        Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                        Spark Partition Pruning Sink Operator
                          partition key expr: hr
                          Statistics: Num rows: 1 Data size: 27 Basic stats: COMPLETE Column
stats: NONE
                          target column name: hr
                          target work: Map 1

Stage: Stage-1
  Spark
    Edges:
      Reducer 2 <- Map 1 (PARTITION-LEVEL SORT, 2), Map 4 (PARTITION-LEVEL SORT, 2)
      Reducer 3 <- Reducer 2 (GROUP, 1)
{code}

but when i use following command to generate new spark_dynamic_partition_pruning.q.out
{code}
mvn clean test -Dtest=TestSparkCliDriver -Dtest.output.overwrite=true -Dqfile= spark_dynamic_partition_pruning.q
{code}

I found it not only changed above explain, but also change others. the changes like
1. comment like "SORT_QUERY_RESULTS" deleted, how to keep original format?
{code}
-PREHOOK: query: -- SORT_QUERY_RESULTS
-
-select distinct ds from srcpart
+PREHOOK: query: select distinct ds from srcpart
+POSTHOOK: query: EXPLAIN select count(*) from srcpart join srcpart_date on (srcpart.ds =
srcpart_date.ds) join srcpart_hour on (srcpart.hr = srcpart_hour.hr) 
{code}

2.  some changes is not caused by HIVE-11297.3.patch. like "filter Operator is added in the
explain plan"
{code}
POSTHOOK: type: QUERY
 STAGE DEPENDENCIES:
   Stage-2 is a root stage
@@ -3168,16 +3141,19 @@ STAGE PLANS:
                 mode: mergepartial
                 outputColumnNames: _col0
                 Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats:
NONE
-                Group By Operator
-                  keys: _col0 (type: string)
-                  mode: hash
-                  outputColumnNames: _col0
-                  Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats:
NONE
-                  Reduce Output Operator
-                    key expressions: _col0 (type: string)
-                    sort order: +
-                    Map-reduce partition columns: _col0 (type: string)
+                Filter Operator
+                  predicate: _col0 is not null (type: boolean)
+                  Statistics: Num rows: 1 Data size: 184 Basic stats: COMPLETE Column stats:
NONE
+                  Group By Operator
+                    keys: _col0 (type: string)
+                    mode: hash
+                    outputColumnNames: _col0
                     Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column stats:
NONE
+                    Reduce Output Operator
+                      key expressions: _col0 (type: string)
+                      sort order: +
+                      Map-reduce partition columns: _col0 (type: string)
+                      Statistics: Num rows: 2 Data size: 368 Basic stats: COMPLETE Column
stats: NONE
{code}

How to solve the changes in spark.dynamic.partition.pruning.q.out?
1. just copy the change caused by HIVE-11297.3
2. use "-Dtest.output.overwrite=true" to generate a new *q.out
which do you prefer?


> Combine op trees for partition info generating tasks [Spark branch]
> -------------------------------------------------------------------
>
>                 Key: HIVE-11297
>                 URL: https://issues.apache.org/jira/browse/HIVE-11297
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: spark-branch
>            Reporter: Chao Sun
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-11297.1.patch, HIVE-11297.2.patch, HIVE-11297.3.patch
>
>
> Currently, for dynamic partition pruning in Spark, if a small table generates partition
info for more than one partition columns, multiple operator trees are created, which all start
from the same table scan op, but have different spark partition pruning sinks.
> As an optimization, we can combine these op trees and so don't have to do table scan
multiple times.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message