hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Costas Piliotis (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-13623) Hive on tez produce wrong results when withClause and (outer) joins
Date Wed, 04 Jan 2017 19:03:58 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13623?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15799003#comment-15799003
] 

Costas Piliotis edited comment on HIVE-13623 at 1/4/17 7:03 PM:
----------------------------------------------------------------

Hi, I've just seen this issue too on tez 0.7.0.  

Query works fine on mr engine and fine in presto but tez is producing incorrect results. 
 I'm using a CTE

{code:sql}
select  tableaCTE.intCol1, tableaCTE.strCol2
from (
  select distinct intCol1, strCol2
  from tablea ) tableaCTE
    left outer join tableb on (
             tableaCTE.strCol2=tableb.strCol2 AND
            tableaCTE.intCol1=tableb.intCol1
           )
where tableb.intCol1 IS NULL
{code}

I've tried rewriting the query to use a WITH clause for the CTE and it made no difference.

*however* if I save my data to a temp table first, the query does return correct results,
so I'd be inclined to think that this is a runtime interpretation of the CTE being handled
incorrectly.

When I set the execution engine to mapreduce, it works fine and returns expected results,
and this same query returns expected results in presto as well.    It appears to be a tez
related issue.

I know this is marked as trivial, but I'll argue that if hiveQL is returning incorrect results
on this pattern it shouldn't be interpreted as trivial.

Below is my actual explain plan for my real query:
{code}
Explain
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (CUSTOM_SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 3
         File Output Operator [FS_28]
            compressed:true
            Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE Column
stats: NONE
            table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
            Select Operator [OP_27]
            |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]
            |  Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE Column
stats: NONE
            |<-Reducer 2 [SIMPLE_EDGE]
               Reduce Output Operator [RS_26]
                  key expressions:_col0 (type: int), _col1 (type: string)
                  sort order:++
                  Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                  value expressions:_col2 (type: int), _col3 (type: string), _col4 (type:
boolean)
                  Select Operator [OP_25]
                     outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]
                     Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                     Filter Operator [FIL_24]
                        predicate:_col4 is null (type: boolean)
                        Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                        Map Join Operator [MAPJOIN_23]
                        |  condition map:[{"":"Left Outer Join0 to 1"}]
                        |  keys:{"Reducer 2":"_col0 (type: int), _col1 (type: string)","Map
4":"dim_language_id (type: int), localized_level_name (type: string)"}
                        |  outputColumnNames:["_col0","_col1","_col4","_col5","_col6"]
                        |  Statistics:Num rows: 10064852 Data size: 1046744745 Basic stats:
COMPLETE Column stats: NONE
                        |<-Map 4 [CUSTOM_SIMPLE_EDGE]
                        |  Reduce Output Operator [RS_20]
                        |     key expressions:dim_language_id (type: int), localized_level_name
(type: string)
                        |     Map-reduce partition columns:dim_language_id (type: int), localized_level_name
(type: string)
                        |     sort order:++
                        |     Statistics:Num rows: 7741 Data size: 1068258 Basic stats: COMPLETE
Column stats: NONE
                        |     value expressions:dim_level_nl_id (type: int)
                        |     TableScan [TS_6]
                        |        alias:da
                        |        Statistics:Num rows: 7741 Data size: 1068258 Basic stats:
COMPLETE Column stats: NONE
                        |<-Select Operator [OP_22]
                              outputColumnNames:["_col0","_col1"]
                              Statistics:Num rows: 9149866 Data size: 951586112 Basic stats:
COMPLETE Column stats: NONE
                              Group By Operator [OP_21]
                              |  keys:KEY._col0 (type: int), KEY._col1 (type: string), KEY._col2
(type: int)
                              |  outputColumnNames:["_col0","_col1","_col2"]
                              |  Statistics:Num rows: 9149866 Data size: 951586112 Basic stats:
COMPLETE Column stats: NONE
                              |<-Map 1 [SIMPLE_EDGE]
                                 Reduce Output Operator [RS_3]
                                    key expressions:_col0 (type: int), _col1 (type: string),
_col2 (type: int)
                                    Map-reduce partition columns:_col0 (type: int), _col1
(type: string), _col2 (type: int)
                                    sort order:+++
                                    Statistics:Num rows: 18299732 Data size: 1903172224 Basic
stats: COMPLETE Column stats: NONE
                                    Group By Operator [GBY_2]
                                       keys:dim_language_id (type: int), level (type: string),
1 (type: int)
                                       outputColumnNames:["_col0","_col1","_col2"]
                                       Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
                                       Select Operator [SEL_1]
                                          outputColumnNames:["dim_language_id","level"]
                                          Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
                                          TableScan [TS_0]
                                             alias:trapteam_fact_event_timeline
                                             Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
{code}

I suspect this issue is related to HIVE-14027 as well


was (Author: cpiliotis):
Hi, I've just seen this issue too on tez 0.7.0.  

Query works fine on mr engine and fine in presto but tez is producing incorrect results. 
 I'm using a CTE

{code:sql}
select  tableaCTE.intCol1, tableaCTE.strCol2
from (
  select distinct intCol1, strCol2
  from tablea ) tableaCTE
    left outer join tableb on (
             tableaCTE.strCol2=tableb.strCol2 AND
            tableaCTE.intCol1=tableb.intCol1
           )
where tableb.intCol1 IS NULL
{code}

I've tried rewriting the query to use a WITH clause for the CTE and it made no difference.

*however* if I save my data to a temp table first, the query does return correct results,
so I'd be inclined to think that this is a runtime interpretation of the CTE being handled
incorrectly.

When I set the execution engine to mapreduce, it works fine and returns expected results,
and this same query returns expected results in presto as well.    It appears to be a tez
related issue.

I know this is marked as trivial, but I'll argue that if hiveQL is returning incorrect results
on this pattern it shouldn't be interpreted as trivial.

Below is my actual explain plan for my real query:
{code}
Explain
Plan not optimized by CBO.

Vertex dependency in root stage
Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (CUSTOM_SIMPLE_EDGE)
Reducer 3 <- Reducer 2 (SIMPLE_EDGE)

Stage-0
   Fetch Operator
      limit:-1
      Stage-1
         Reducer 3
         File Output Operator [FS_28]
            compressed:true
            Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE Column
stats: NONE
            table:{"serde:":"org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe","input format:":"org.apache.hadoop.mapred.TextInputFormat","output
format:":"org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat"}
            Select Operator [OP_27]
            |  outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]
            |  Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE Column
stats: NONE
            |<-Reducer 2 [SIMPLE_EDGE]
               Reduce Output Operator [RS_26]
                  key expressions:_col0 (type: int), _col1 (type: string)
                  sort order:++
                  Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                  value expressions:_col2 (type: int), _col3 (type: string), _col4 (type:
boolean)
                  Select Operator [OP_25]
                     outputColumnNames:["_col0","_col1","_col2","_col3","_col4"]
                     Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                     Filter Operator [FIL_24]
                        predicate:_col4 is null (type: boolean)
                        Statistics:Num rows: 5032426 Data size: 523372372 Basic stats: COMPLETE
Column stats: NONE
                        Map Join Operator [MAPJOIN_23]
                        |  condition map:[{"":"Left Outer Join0 to 1"}]
                        |  keys:{"Reducer 2":"_col0 (type: int), _col1 (type: string)","Map
4":"dim_language_id (type: int), localized_level_name (type: string)"}
                        |  outputColumnNames:["_col0","_col1","_col4","_col5","_col6"]
                        |  Statistics:Num rows: 10064852 Data size: 1046744745 Basic stats:
COMPLETE Column stats: NONE
                        |<-Map 4 [CUSTOM_SIMPLE_EDGE]
                        |  Reduce Output Operator [RS_20]
                        |     key expressions:dim_language_id (type: int), localized_level_name
(type: string)
                        |     Map-reduce partition columns:dim_language_id (type: int), localized_level_name
(type: string)
                        |     sort order:++
                        |     Statistics:Num rows: 7741 Data size: 1068258 Basic stats: COMPLETE
Column stats: NONE
                        |     value expressions:dim_level_nl_id (type: int)
                        |     TableScan [TS_6]
                        |        alias:da
                        |        Statistics:Num rows: 7741 Data size: 1068258 Basic stats:
COMPLETE Column stats: NONE
                        |<-Select Operator [OP_22]
                              outputColumnNames:["_col0","_col1"]
                              Statistics:Num rows: 9149866 Data size: 951586112 Basic stats:
COMPLETE Column stats: NONE
                              Group By Operator [OP_21]
                              |  keys:KEY._col0 (type: int), KEY._col1 (type: string), KEY._col2
(type: int)
                              |  outputColumnNames:["_col0","_col1","_col2"]
                              |  Statistics:Num rows: 9149866 Data size: 951586112 Basic stats:
COMPLETE Column stats: NONE
                              |<-Map 1 [SIMPLE_EDGE]
                                 Reduce Output Operator [RS_3]
                                    key expressions:_col0 (type: int), _col1 (type: string),
_col2 (type: int)
                                    Map-reduce partition columns:_col0 (type: int), _col1
(type: string), _col2 (type: int)
                                    sort order:+++
                                    Statistics:Num rows: 18299732 Data size: 1903172224 Basic
stats: COMPLETE Column stats: NONE
                                    Group By Operator [GBY_2]
                                       keys:dim_language_id (type: int), level (type: string),
1 (type: int)
                                       outputColumnNames:["_col0","_col1","_col2"]
                                       Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
                                       Select Operator [SEL_1]
                                          outputColumnNames:["dim_language_id","level"]
                                          Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
                                          TableScan [TS_0]
                                             alias:trapteam_fact_event_timeline
                                             Statistics:Num rows: 18299732 Data size: 1903172224
Basic stats: COMPLETE Column stats: NONE
{code}

> Hive on tez produce wrong results when withClause and (outer) joins
> -------------------------------------------------------------------
>
>                 Key: HIVE-13623
>                 URL: https://issues.apache.org/jira/browse/HIVE-13623
>             Project: Hive
>          Issue Type: Bug
>    Affects Versions: 1.1.0, 2.0.0
>            Reporter: JinsuKim
>            Assignee: Sivashankar
>            Priority: Trivial
>
> Hive on mr produce correct results when (outer) joins and withClause. But tez produce
wrong results 
> {code:sql|title=Case1}
> with a as (
> select 1 as c1
> union all
> select 2 as c1
> union all
> select 3 as c1
> ),
> b as (
> select 1 as c1
> )
> select * 
>   from ( select a.c1 as ac1, b.c1 as bc1 
>          from a left outer join b 
>          on a.c1 = b.c1 
>        ) c;
>   where c.bc1 is null
> {code}
> {code:title=case1 result}
> mr :
> +--------+--------+--+
> | c.ac1  | c.bc1  |
> +--------+--------+--+
> | 2      | NULL   |
> | 3      | NULL   |
> +--------+--------+--+
> tez :
> +--------+--------+--+
> | c.ac1  | c.bc1  |
> +--------+--------+--+
> +--------+--------+--+
> {code}
> Case2 as similar to Case1 produces same results.
> {code:sql|title=Case2}
> with a as (
> select * from j1
> ),
> b as (
> select * from j2
> )
> select * 
>   from ( select a.c1 as ac1, b.c1 as bc1 
>          from a left outer join b 
>          on a.c1 = b.c1 
>        ) c
>   where c.bc1 is null;
>   
> drop table j1;
> create table j1 as select c1 from 
> ( select 1 as c1 from default.dual
> union all
> select 2 as c1 from default.dual
> union all
> select 3 as c1 from default.dual
> ) t1;
> drop table j2;
> create table j2 as
> select 1 as c1 from default.dual;
> {code}
> {code:title=case2 result}
> mr :
> +--------+--------+--+
> | c.ac1  | c.bc1  |
> +--------+--------+--+
> | 2      | NULL   |
> | 3      | NULL   |
> +--------+--------+--+
> tez :
> +--------+--------+--+
> | c.ac1  | c.bc1  |
> +--------+--------+--+
> | 2      | NULL   |
> | 3      | NULL   |
> +--------+--------+--+
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message