hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-9112) Query may generate different results depending on the number of reducers
Date Tue, 06 Jan 2015 20:14:35 GMT

    [ https://issues.apache.org/jira/browse/HIVE-9112?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14266681#comment-14266681
] 

Chao commented on HIVE-9112:
----------------------------

Hi [~tedxu], looks like this is related to Constant Propagation. The (partial) plan with this
optimization:

{noformat}
  ...
 78   Stage: Stage-3
 79     Map Reduce
 80       Map Operator Tree:
 81           TableScan
 82             Reduce Output Operator
 83               key expressions: _col1 (type: int), 1 (type: int)
 84               sort order: ++
 85               Map-reduce partition columns: _col1 (type: int)
 86               Statistics: Num rows: 27 Data size: 3298 Basic stats: COMPLETE Column stats:
NONE
 87               value expressions: _col0 (type: int), _col3 (type: int)
 88           TableScan
 89             alias: lineitem
 90             Statistics: Num rows: 100 Data size: 11999 Basic stats: COMPLETE Column stats:
NONE
 91             Filter Operator
 92               predicate: ((((l_shipmode = 'AIR') and l_orderkey is not null) and l_linenumber
is not null) and (l_linenumber = 1)) (type: boolean)
 93               Statistics: Num rows: 6 Data size: 719 Basic stats: COMPLETE Column stats:
NONE
 94               Select Operator
 95                 expressions: l_orderkey (type: int), 1 (type: int)
 96                 outputColumnNames: _col0, _col1
 97                 Statistics: Num rows: 6 Data size: 719 Basic stats: COMPLETE Column stats:
NONE
 98                 Group By Operator
 99                   keys: _col0 (type: int), _col1 (type: int)
100                   mode: hash
101                   outputColumnNames: _col0, _col1
102                   Statistics: Num rows: 6 Data size: 719 Basic stats: COMPLETE Column
stats: NONE
103                   Reduce Output Operator
104                     key expressions: _col0 (type: int), _col1 (type: int)
105                     sort order: ++
106                     Map-reduce partition columns: _col0 (type: int), _col1 (type: int)
107                     Statistics: Num rows: 6 Data size: 719 Basic stats: COMPLETE Column
stats: NONE
108       Reduce Operator Tree:
109         Join Operator
110           condition map:
111                Left Semi Join 0 to 1
112           keys:
113             0 _col1 (type: int), _col4 (type: int)
114             1 _col0 (type: int), _col1 (type: int)
115           outputColumnNames: _col0, _col3
116           Statistics: Num rows: 29 Data size: 3627 Basic stats: COMPLETE Column stats:
NONE
117           Select Operator
118             expressions: _col0 (type: int), _col3 (type: int)
119             outputColumnNames: _col0, _col1
120             Statistics: Num rows: 29 Data size: 3627 Basic stats: COMPLETE Column stats:
NONE
121             File Output Operator
122               compressed: false
123               Statistics: Num rows: 29 Data size: 3627 Basic stats: COMPLETE Column stats:
NONE
124               table:
125                   input format: org.apache.hadoop.mapred.TextInputFormat
126                   output format: org.apache.hadoop.hive.ql.io.HiveIgnoreKeyTextOutputFormat
127                   serde: org.apache.hadoop.hive.serde2.lazy.LazySimpleSerDe
...
{noformat}

And diff for this part (on the left is the plan w/o the optimization):

{noformat}
83c83
<               key expressions: _col1 (type: int), _col4 (type: int)
---
>               key expressions: _col1 (type: int), 1 (type: int)
85c85
<               Map-reduce partition columns: _col1 (type: int), _col4 (type: int)
---
>               Map-reduce partition columns: _col1 (type: int)
95c95
<                 expressions: l_orderkey (type: int), l_linenumber (type: int)
---
>                 expressions: l_orderkey (type: int), 1 (type: int)
{noformat}

Notice that on line 85, the MR partition column {{_col4}} has been optimized away, which causes
an inconsistency.
Later on, output rows for join will be hashed to different reducers, and therefore introduces
wrong results.

I saw that [~navis] has a [comment|https://issues.apache.org/jira/browse/HIVE-7232?focusedCommentId=14032106&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14032106]
about some similar issue, maybe it's related?

I'm not an expert in Constant Propagation, and I'm thinking whether you can take a look at
this issue? Thanks.

> Query may generate different results depending on the number of reducers
> ------------------------------------------------------------------------
>
>                 Key: HIVE-9112
>                 URL: https://issues.apache.org/jira/browse/HIVE-9112
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Chao
>            Assignee: Chao
>
> Some queries may generate different results depending on the number of reducers, for
example, tests like ppd_multi_insert.q, join_nullsafe.q, subquery_in.q, etc.
> Take subquery_in.q as example, if we add
> {noformat}
> set mapred.reduce.tasks=3;
> {noformat}
> to this test file, the result will be different (and wrong):
> {noformat}
> @@ -903,5 +903,3 @@ where li.l_linenumber = 1 and
>  POSTHOOK: type: QUERY
>  POSTHOOK: Input: default@lineitem
>  #### A masked pattern was here ####
> -108570 8571
> -4297   1798
> {noformat}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message