hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Alexander Pivovarov (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-10841) [WHERE col is not null] does not work sometimes for queries with many JOIN statements
Date Tue, 09 Jun 2015 05:31:01 GMT

    [ https://issues.apache.org/jira/browse/HIVE-10841?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14578345#comment-14578345
] 

Alexander Pivovarov commented on HIVE-10841:
--------------------------------------------

[~jpullokkaran], thank you for HIVE-10841.1.patch.
It changes 2 files:
- SemanticAnalyzer.java
- OpProcFactory.java

I tried the fix in SemanticAnalyzer.java only. It solve the issue with my test query.
So, looks like the fix in SemanticAnalyzer.java is enough to solve the issue.

Why do we need fixes in OpProcFactory.java? Should we open separate Jira for them?

2. Looks like we need to rerun bunch of Cli, Tez and Spark tests...

> [WHERE col is not null] does not work sometimes for queries with many JOIN statements
> -------------------------------------------------------------------------------------
>
>                 Key: HIVE-10841
>                 URL: https://issues.apache.org/jira/browse/HIVE-10841
>             Project: Hive
>          Issue Type: Bug
>          Components: Query Planning, Query Processor
>    Affects Versions: 0.13.0, 0.14.0, 0.13.1, 1.2.0, 1.3.0
>            Reporter: Alexander Pivovarov
>            Assignee: Alexander Pivovarov
>         Attachments: HIVE-10841.1.patch, HIVE-10841.patch
>
>
> The result from the following SELECT query is 3 rows but it should be 1 row.
> I checked it in MySQL - it returned 1 row.
> To reproduce the issue in Hive
> 1. prepare tables
> {code}
> drop table if exists L;
> drop table if exists LA;
> drop table if exists FR;
> drop table if exists A;
> drop table if exists PI;
> drop table if exists acct;
> create table L as select 4436 id;
> create table LA as select 4436 loan_id, 4748 aid, 4415 pi_id;
> create table FR as select 4436 loan_id;
> create table A as select 4748 id;
> create table PI as select 4415 id;
> create table acct as select 4748 aid, 10 acc_n, 122 brn;
> insert into table acct values(4748, null, null);
> insert into table acct values(4748, null, null);
> {code}
> 2. run SELECT query
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid
> WHERE
>   L.id = 4436
>   and acct.brn is not null;
> {code}
> the result is 3 rows
> {code}
> 10	122
> NULL	NULL
> NULL	NULL
> {code}
> but it should be 1 row
> {code}
> 10	122
> {code}
> 2.1 "explain select ..." output for hive-1.3.0 MR
> {code}
> STAGE DEPENDENCIES:
>   Stage-12 is a root stage
>   Stage-9 depends on stages: Stage-12
>   Stage-0 depends on stages: Stage-9
> STAGE PLANS:
>   Stage: Stage-12
>     Map Reduce Local Work
>       Alias -> Map Local Tables:
>         a 
>           Fetch Operator
>             limit: -1
>         acct 
>           Fetch Operator
>             limit: -1
>         fr 
>           Fetch Operator
>             limit: -1
>         l 
>           Fetch Operator
>             limit: -1
>         pi 
>           Fetch Operator
>             limit: -1
>       Alias -> Map Local Operator Tree:
>         a 
>           TableScan
>             alias: a
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         acct 
>           TableScan
>             alias: acct
>             Statistics: Num rows: 3 Data size: 31 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: aid is not null (type: boolean)
>               Statistics: Num rows: 2 Data size: 20 Basic stats: COMPLETE Column stats:
NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col5 (type: int)
>                   1 id (type: int)
>                   2 aid (type: int)
>         fr 
>           TableScan
>             alias: fr
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: (loan_id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         l 
>           TableScan
>             alias: l
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: (id = 4436) (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>               HashTable Sink Operator
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>         pi 
>           TableScan
>             alias: pi
>             Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: id is not null (type: boolean)
>               Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>               HashTable Sink Operator
>                 keys:
>                   0 _col6 (type: int)
>                   1 id (type: int)
>   Stage: Stage-9
>     Map Reduce
>       Map Operator Tree:
>           TableScan
>             alias: la
>             Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats:
NONE
>             Filter Operator
>               predicate: (((loan_id is not null and aid is not null) and pi_id is not
null) and (loan_id = 4436)) (type: boolean)
>               Statistics: Num rows: 1 Data size: 14 Basic stats: COMPLETE Column stats:
NONE
>               Map Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 keys:
>                   0 4436 (type: int)
>                   1 4436 (type: int)
>                   2 4436 (type: int)
>                 outputColumnNames: _col5, _col6
>                 Statistics: Num rows: 2 Data size: 8 Basic stats: COMPLETE Column stats:
NONE
>                 Map Join Operator
>                   condition map:
>                        Inner Join 0 to 1
>                        Inner Join 1 to 2
>                   keys:
>                     0 _col5 (type: int)
>                     1 id (type: int)
>                     2 aid (type: int)
>                   outputColumnNames: _col6, _col19, _col20
>                   Statistics: Num rows: 4 Data size: 17 Basic stats: COMPLETE Column
stats: NONE
>                   Map Join Operator
>                     condition map:
>                          Inner Join 0 to 1
>                     keys:
>                       0 _col6 (type: int)
>                       1 id (type: int)
>                     outputColumnNames: _col19, _col20
>                     Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column
stats: NONE
>                     Select Operator
>                       expressions: _col19 (type: int), _col20 (type: int)
>                       outputColumnNames: _col0, _col1
>                       Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column
stats: NONE
>                       File Output Operator
>                         compressed: false
>                         Statistics: Num rows: 4 Data size: 18 Basic stats: COMPLETE Column
stats: NONE
>                         table:
>                             input format: org.apache.hadoop.mapred.TextInputFormat
>                             output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                             serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>       Local Work:
>         Map Reduce Local Work
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
>       Processor Tree:
>         ListSink
> Time taken: 0.57 seconds, Fetched: 142 row(s)
> {code}
> 2.2. "explain select..." output for hive-0.13.1 Tez
> {code}
> STAGE DEPENDENCIES:
>   Stage-1 is a root stage
>   Stage-0 is a root stage
> STAGE PLANS:
>   Stage: Stage-1
>     Tez
>       Edges:
>         Reducer 2 <- Map 1 (SIMPLE_EDGE), Map 4 (SIMPLE_EDGE), Reducer 6 (SIMPLE_EDGE)
>         Reducer 3 <- Reducer 2 (SIMPLE_EDGE), Map 9 (SIMPLE_EDGE)
>         Reducer 6 <- Map 5 (SIMPLE_EDGE), Map 7 (SIMPLE_EDGE), Map 8 (SIMPLE_EDGE)
>       DagName: lcapp_20150528111717_06c57a5b-8dc6-4ce9-bce7-b9e0a7818fe4:1
>       Vertices:
>         Map 1 
>             Map Operator Tree:
>                 TableScan
>                   alias: acct
>                   Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column stats:
NONE
>                   Reduce Output Operator
>                     key expressions: aid (type: int)
>                     sort order: +
>                     Map-reduce partition columns: aid (type: int)
>                     Statistics: Num rows: 1 Data size: 4 Basic stats: COMPLETE Column
stats: NONE
>                     value expressions: acc_n (type: int), brn (type: int)
>         Map 4 
>             Map Operator Tree:
>                 TableScan
>                   alias: a
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>         Map 5 
>             Map Operator Tree:
>                 TableScan
>                   alias: la
>                   Statistics: Num rows: 28 Data size: 347 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column
stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 14 Data size: 173 Basic stats: COMPLETE Column
stats: NONE
>                       value expressions: aid (type: int), pi_id (type: int)
>         Map 7 
>             Map Operator Tree:
>                 TableScan
>                   alias: fr
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: (loan_id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column
stats: NONE
>                     Reduce Output Operator
>                       key expressions: loan_id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: loan_id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column
stats: NONE
>         Map 8 
>             Map Operator Tree:
>                 TableScan
>                   alias: l
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>                   Filter Operator
>                     predicate: (id = 4436) (type: boolean)
>                     Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column
stats: NONE
>                     Reduce Output Operator
>                       key expressions: id (type: int)
>                       sort order: +
>                       Map-reduce partition columns: id (type: int)
>                       Statistics: Num rows: 23 Data size: 93 Basic stats: COMPLETE Column
stats: NONE
>         Map 9 
>             Map Operator Tree:
>                 TableScan
>                   alias: pi
>                   Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>                   Reduce Output Operator
>                     key expressions: id (type: int)
>                     sort order: +
>                     Map-reduce partition columns: id (type: int)
>                     Statistics: Num rows: 46 Data size: 187 Basic stats: COMPLETE Column
stats: NONE
>         Reducer 2 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 1 to 2
>                 condition expressions:
>                   0 {VALUE._col2}
>                   1 
>                   2 {VALUE._col1} {VALUE._col2}
>                 outputColumnNames: _col2, _col15, _col16
>                 Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column
stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col2 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col2 (type: int)
>                   Statistics: Num rows: 110 Data size: 448 Basic stats: COMPLETE Column
stats: NONE
>                   value expressions: _col15 (type: int), _col16 (type: int)
>         Reducer 3 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                 condition expressions:
>                   0 {VALUE._col1} {VALUE._col2}
>                   1 
>                 outputColumnNames: _col1, _col2
>                 Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column
stats: NONE
>                 Select Operator
>                   expressions: _col1 (type: int), _col2 (type: int)
>                   outputColumnNames: _col0, _col1
>                   Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column
stats: NONE
>                   File Output Operator
>                     compressed: false
>                     Statistics: Num rows: 121 Data size: 492 Basic stats: COMPLETE Column
stats: NONE
>                     table:
>                         input format: org.apache.hadoop.mapred.TextInputFormat
>                         output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
>                         serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
>         Reducer 6 
>             Reduce Operator Tree:
>               Join Operator
>                 condition map:
>                      Inner Join 0 to 1
>                      Inner Join 0 to 2
>                 condition expressions:
>                   0 
>                   1 {VALUE._col1} {VALUE._col2}
>                   2 
>                 outputColumnNames: _col4, _col5
>                 Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column
stats: NONE
>                 Reduce Output Operator
>                   key expressions: _col4 (type: int)
>                   sort order: +
>                   Map-reduce partition columns: _col4 (type: int)
>                   Statistics: Num rows: 50 Data size: 204 Basic stats: COMPLETE Column
stats: NONE
>                   value expressions: _col5 (type: int)
>   Stage: Stage-0
>     Fetch Operator
>       limit: -1
> Time taken: 1.377 seconds, Fetched: 146 row(s)
> {code}
> 3. The workaround is to put "acct.brn is not null" to join condition
> {code}
> select
>   acct.ACC_N,
>   acct.brn
> FROM L
> JOIN LA ON L.id = LA.loan_id
> JOIN FR ON L.id = FR.loan_id
> JOIN A ON LA.aid = A.id
> JOIN PI ON PI.id = LA.pi_id
> JOIN acct ON A.id = acct.aid and acct.brn is not null
> WHERE
>   L.id = 4436;
> OK
> 10	122
> Time taken: 23.479 seconds, Fetched: 1 row(s)
> {code}
> I tried it on hive-1.3.0 (MR) and hive-0.13.1 (MR and Tez) - all combinations have the
issue



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message