hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jesus Camacho Rodriguez (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-12465) Hive might produce wrong results when (outer) joins are merged
Date Tue, 24 Nov 2015 15:36:11 GMT

    [ https://issues.apache.org/jira/browse/HIVE-12465?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15024688#comment-15024688
] 

Jesus Camacho Rodriguez commented on HIVE-12465:
------------------------------------------------

[~ashutoshc], could you take a look? I need to regenerate the mergejoin.q test file for Tez;
the rest of the test fails are not related.

SemanticAnalyzer itself has some simple logic to push predicates down. This patch checks whether
the join below is an outer join, in which case the predicate cannot be pushed, as it would
lead to incorrect results. I have added tests to mergejoin.q that reproduce the issue.

> Hive might produce wrong results when (outer) joins are merged
> --------------------------------------------------------------
>
>                 Key: HIVE-12465
>                 URL: https://issues.apache.org/jira/browse/HIVE-12465
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.3.0, 2.0.0
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Blocker
>         Attachments: HIVE-12465.01.patch, HIVE-12465.patch
>
>
> Consider the following query:
> {noformat}
> select * from
>   (select * from tab where tab.key = 0)a
> full outer join
>   (select * from tab_part where tab_part.key = 98)b
> join
>   tab_part c
> on a.key = b.key and b.key = c.key;
> {noformat}
> Hive should execute the full outer join operation (without ON clause) and then the join
operation (ON a.key = b.key and b.key = c.key). Instead, it merges both joins, generating
the following plan:
> {noformat}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 depends on stages: Stage-1
> STAGE PLANS:
>   Stage: Stage-1
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: tab
>             filterExpr: (key = 0) (type: boolean)
>             Statistics: Num rows: 242 Data size: 22748 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: (key = 0) (type: boolean)
>               Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column
stats: NONE
>               Select Operator
>                 expressions: 0 (type: int), value (type: string), ds (type: string)
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column
stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 121 Data size: 11374 Basic stats: COMPLETE Column
stats: NONE
>                   value expressions: _col1 (type: string), _col2 (type: string)
>           TableScan
>             alias: tab_part
>             filterExpr: (key = 98) (type: boolean)
>             Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: (key = 98) (type: boolean)
>               Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column
stats: NONE
>               Select Operator
>                 expressions: 98 (type: int), value (type: string), ds (type: string)
>                 outputColumnNames: _col0, _col1, _col2
>                 Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column
stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col0 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col0 (type: int)
>                   Statistics: Num rows: 250 Data size: 23500 Basic stats: COMPLETE Column
stats: NONE
>                   value expressions: _col1 (type: string), _col2 (type: string)
>           TableScan
>             alias: c
>             Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column stats:
NONE
>             Reduce Output Operator
>               key expressions: key (type: int)
>               sort order: +
>               Map-reduce partition columns: key (type: int)
>               Statistics: Num rows: 500 Data size: 47000 Basic stats: COMPLETE Column
stats: NONE
>               value expressions: value (type: string), ds (type: string)
>       Reduce Operator Tree:
>         Join Operator
>           condition map:
>                Outer Join 0 to 1
>                Inner Join 1 to 2
>           keys:
>             0 _col0 (type: int)
>             1 _col0 (type: int)
>             2 key (type: int)
>           outputColumnNames: _col0, _col1, _col2, _col3, _col4, _col5, _col6, _col7,
_col8
>           Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column stats:
NONE
>           File Output Operator
>             compressed: false
>             Statistics: Num rows: 1100 Data size: 103400 Basic stats: COMPLETE Column
stats: NONE
>             table:
>                 input format: org.apache.hadoop.mapred.TextInputFormat
>                 output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                 serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> {noformat}
> That plan is equivalent to the following query, which is different than the original
one:
> {noformat}
> select * from
>   (select * from tab where tab.key = 0)a
> full outer join
>   (select * from tab_part where tab_part.key = 98)b
> on a.key = b.key
> join
>   tab_part c
> on b.key = c.key;
> {noformat}
> It seems to be a problem in the recognition of join operations that can be merged into
a single multijoin operator.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message