hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-13693) Multi-insert query drops Filter before file output when there is a.val <> b.val
Date Thu, 05 May 2016 17:43:12 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13693?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15272699#comment-15272699
] 

Jesus Camacho Rodriguez commented on HIVE-13693:
------------------------------------------------

Problem is in PPD and it only happens for multi-insert.

HIVE-9695 introduced partial removal of predicates in Filter operators. It was a fairly simple
patch: given a Filter operator, a new Filter will replace the original one with the parts
of the condition that cannot be pushed down, while the other parts of the condition will be
pushed.
The patch relied on createFilter for the logic of Filter creation, that would replace/remove
the original Filter immediately after the predicate was pushed.

The problem comes with operators with multiple outputs i.e. multi-insert, for which PPD does
not pushed predicates through. In particular, when we are executing PPD and we detect an operator
with multiple outputs, we just jump out to its input and thus we do not continue pushing the
predicates. However, due to the logic introduced in HIVE-9695, the original Filter already
got removed...

Attached a fix that recreates the Filter operator(s) if necessary on top of the multi-output
operator.

> Multi-insert query drops Filter before file output when there is a.val <> b.val
> -------------------------------------------------------------------------------
>
>                 Key: HIVE-13693
>                 URL: https://issues.apache.org/jira/browse/HIVE-13693
>             Project: Hive
>          Issue Type: Bug
>          Components: Logical Optimizer
>    Affects Versions: 1.3.0, 2.0.0, 2.1.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>
> To reproduce:
> {noformat}
> CREATE TABLE T_A ( id STRING, val STRING ); 
> CREATE TABLE T_B ( id STRING, val STRING ); 
> CREATE TABLE join_result_1 ( ida STRING, vala STRING, idb STRING, valb STRING ); 
> CREATE TABLE join_result_3 ( ida STRING, vala STRING, idb STRING, valb STRING ); 
> INSERT INTO TABLE T_A 
> VALUES ('Id_1', 'val_101'), ('Id_2', 'val_102'), ('Id_3', 'val_103'); 
> INSERT INTO TABLE T_B 
> VALUES ('Id_1', 'val_103'), ('Id_2', 'val_104'); 
> explain
> FROM T_A a LEFT JOIN T_B b ON a.id = b.id
> INSERT OVERWRITE TABLE join_result_1
> SELECT a.*, b.*
> WHERE b.id = 'Id_1' AND b.val = 'val_103'
> INSERT OVERWRITE TABLE join_result_3
> SELECT a.*, b.*
> WHERE b.val = 'val_104' AND b.id = 'Id_2' AND a.val <> b.val;
> {noformat}
> The (wrong) plan is the following:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-2 is a root stage
>   Stage-3 depends on stages: Stage-2
>   Stage-0 depends on stages: Stage-3
>   Stage-4 depends on stages: Stage-0
>   Stage-1 depends on stages: Stage-3
>   Stage-5 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-2
>     Tez
>       DagId: haha_20160504140944_174465c9-5d1a-42f9-9665-fae02eeb2767:2
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 3 (SIMPLE_EDGE)
>       DagName: 
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: a
>                   Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column
stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: string)
>                     sort order: +
>                     Map-reduce partition columns: id (type: string)
>                     Statistics: Num rows: 3 Data size: 36 Basic stats: COMPLETE Column
stats: NONE
>                     value expressions: val (type: string)
>         Map 3 
>             Map Operator Tree:
>                 TableScan
>                   alias: b
>                   Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column
stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: string)
>                     sort order: +
>                     Map-reduce partition columns: id (type: string)
>                     Statistics: Num rows: 2 Data size: 24 Basic stats: COMPLETE Column
stats: NONE
>                     value expressions: val (type: string)
>         Reducer 2 
>             Reduce Operator Tree:
>               Merge Join Operator
>                 condition map:
>                      Left Outer Join0 to 1
>                 keys:
>                   0 id (type: string)
>                   1 id (type: string)
>                 outputColumnNames: _col0, _col1, _col6
>                 Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column stats:
NONE
>                 Select Operator
>                   expressions: _col0 (type: string), _col1 (type: string), 'Id_1' (type:
string), 'val_103' (type: string)
>                   outputColumnNames: _col0, _col1, _col2, _col3
>                   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column
stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column
stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                         name: bugtest2.join_result_1
>                 Filter Operator
>                   predicate: (_col1 <> _col6) (type: boolean)
>                   Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column
stats: NONE
>                   Select Operator
>                     expressions: _col0 (type: string), _col1 (type: string), 'Id_2' (type:
string), 'val_104' (type: string)
>                     outputColumnNames: _col0, _col1, _col2, _col3
>                     Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column
stats: NONE
>                     File Output Operator
>                       compressed: false
>                       Statistics: Num rows: 3 Data size: 39 Basic stats: COMPLETE Column
stats: NONE
>                       table:
>                           input format: org.apache.hadoop.mapred.TextInputFormat
>                           output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                           serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>                           name: bugtest2.join_result_3
>   Stage: Stage-3
>     Dependency Collection
>   Stage: Stage-0
>     Move Operator
>       tables:
>           replace: true
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: bugtest2.join_result_1
>   Stage: Stage-4
>     Stats-Aggr Operator
>   Stage: Stage-1
>     Move Operator
>       tables:
>           replace: true
>           table:
>               input format: org.apache.hadoop.mapred.TextInputFormat
>               output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>               serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>               name: bugtest2.join_result_3
>   Stage: Stage-5
>     Stats-Aggr Operator
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message