hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Vlad Gudikov (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (HIVE-17148) Incorrect result for Hive join query with COALESCE in WHERE condition
Date Fri, 28 Jul 2017 13:08:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17148?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16104919#comment-16104919
] 

Vlad Gudikov edited comment on HIVE-17148 at 7/28/17 1:07 PM:
--------------------------------------------------------------

ROOT-CAUSE:
The problem was with the predicates that were created according to HiveJoinAddNotNullRule.
This rule is creating predicates from fields that take part in join filter, no matter if this
fields are used as parameters of functions or not.

SOLUTION:
Create predicate based on functions that take part in filters as well as fields. The point
is to check if left part and right part of the filter is not null, not just fields that are
part of the join filter. I.e we have to tables test1(a1 int, a2 int) and test2(b1). When we
execute following query *select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);*
we get to predicates for filter operator:
b1 is not null --- right part 
a1 is not null and a2 is not null -- left part

Applying predicate for left part of join will result in data loss as we exclude rows with
null fields. COALESCE is a good example for this case as the main purpose of COALESCE function
is to get not null values from tables. To fix the data loss we need to check that coalesce
won't bring us null values as we can't join nulls. My fix will check that left part and right
part will look like:

b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null values)

In next patch I'm going to change related failed tests with the fixed stage plans.



was (Author: allgoodok):
ROOT-CAUSE:
The problem was with the predicates that were created according to HiveJoinAddNotNullRule.
This rule is creating predicates from fields that take part in join filter, no matter if this
fields are used as parameters of functions or not.

SOLUTION:
Create predicate based on functions that take part in filters as well as fields. The point
is to check if left part and right part of the filter is not null, not just fields that are
part of the join filter. I.e we have to tables test1(a1 int, a2 int) and test2(b1). When we
execute following query strong text*select * from ct1 c1 inner join ct2 c2 on (COALESCE(a1,b1)=a2);*strong
text* we get to predicates for filter operator:
b1 is not null --- right part 
a1 is not null and a2 is not null -- left part

Applying predicate for left part of join will result in data loss as we exclude rows with
null fields. COALESCE is a good example for this case as the main purpose of COALESCE function
is to get not null values from tables. To fix the data loss we need to check that coalesce
won't bring us null values as we can't join nulls. My fix will check that left part and right
part will look like:

b1 is not null -- right part (still checking fields on null condition)
COALESCE(a1,a2) is not null (checking that whole function won't bring us null values)

In next patch I'm going to change related failed tests with the fixed stage plans.


> Incorrect result for Hive join query with COALESCE in WHERE condition
> ---------------------------------------------------------------------
>
>                 Key: HIVE-17148
>                 URL: https://issues.apache.org/jira/browse/HIVE-17148
>             Project: Hive
>          Issue Type: Bug
>          Components: CBO
>    Affects Versions: 2.1.1
>            Reporter: Vlad Gudikov
>            Assignee: Vlad Gudikov
>         Attachments: HIVE-17148.patch
>
>
> The issue exists in Hive-2.1. In Hive-1.2 the query works fine with cbo enabled:
> STEPS TO REPRODUCE:
> {code}
> Step 1: Create a table ct1
> create table ct1 (a1 string,b1 string);
> Step 2: Create a table ct2
> create table ct2 (a2 string);
> Step 3 : Insert following data into table ct1
> insert into table ct1 (a1) values ('1');
> Step 4 : Insert following data into table ct2
> insert into table ct2 (a2) values ('1');
> Step 5 : Execute the following query 
> select * from ct1 c1, ct2 c2 where COALESCE(a1,b1)=a2;
> {code}
> ACTUAL RESULT:
> {code}
> The query returns nothing;
> {code}
> EXPECTED RESULT:
> {code}
> 1       NULL    1
> {code}
> The issue seems to be because of the incorrect query plan. In the plan we can see:
> predicate:(a1 is not null and b1 is not null)
> which does not look correct. As a result, it is filtering out all the rows is any column
mentioned in the COALESCE has null value.
> Please find the query plan below:
> {code}
> Plan optimized by CBO.
> Vertex dependency in root stage
> Map 1 <- Map 2 (BROADCAST_EDGE)
> Stage-0
>   Fetch Operator
>     limit:-1
>     Stage-1
>       Map 1
>       File Output Operator [FS_10]
>         Map Join Operator [MAPJOIN_15] (rows=1 width=4)
>           Conds:SEL_2.COALESCE(_col0,_col1)=RS_7._col0(Inner),HybridGraceHashJoin:true,Output:["_col0","_col1","_col2"]
>         <-Map 2 [BROADCAST_EDGE]
>           BROADCAST [RS_7]
>             PartitionCols:_col0
>             Select Operator [SEL_5] (rows=1 width=1)
>               Output:["_col0"]
>               Filter Operator [FIL_14] (rows=1 width=1)
>                 predicate:a2 is not null
>                 TableScan [TS_3] (rows=1 width=1)
>                   default@ct2,c2,Tbl:COMPLETE,Col:NONE,Output:["a2"]
>         <-Select Operator [SEL_2] (rows=1 width=4)
>             Output:["_col0","_col1"]
>             Filter Operator [FIL_13] (rows=1 width=4)
>               predicate:(a1 is not null and b1 is not null)
>               TableScan [TS_0] (rows=1 width=4)
>                 default@ct1,c1,Tbl:COMPLETE,Col:NONE,Output:["a1","b1"]
> {code}
> This happens only if join is inner type, otherwise HiveJoinAddNotRule which creates this
problem is skipped.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message