hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yin Huai <huaiyin....@gmail.com>
Subject Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases
Date Mon, 26 Aug 2013 11:40:07 GMT
forgot to add in my last reply.... To generate correct results, you can
set hive.optimize.reducededuplication to false to turn off
ReduceSinkDeDuplication


On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:

> Created a jira https://issues.apache.org/jira/browse/HIVE-5149
>
>
> On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:
>
>> Seems ReduceSinkDeDuplication picked the wrong partitioning columns.
>>
>>
>> On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <skp@rocketfuel.com> wrote:
>>
>>> I think the problem lies with in the group by operation. For this
>>> optimization to work the group bys partitioning should be on the column
>>> 1 only.
>>>
>>> It wont effect the correctness of group by, can make it slow but int
>>> this case will fasten the overall query performance.
>>>
>>>
>>> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <
>>> mchettiar@rocketfuelinc.com> wrote:
>>>
>>>> I have attached the hive 10 and 11 query plans, for the sample query
>>>> below, for illustration.
>>>>
>>>>
>>>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <
>>>> mchettiar@rocketfuelinc.com> wrote:
>>>>
>>>>> Hi,
>>>>>
>>>>> We are using DISTRIBUTE BY with custom reducer scripts in our query
>>>>> workload.
>>>>>
>>>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT
>>>>> BY and custom reducer scripts produced incorrect results. Particularly,
>>>>> rows with same value on DISTRIBUTE BY column ends up in multiple reducers
>>>>> and thus produce multiple rows in final result, when we expect only one.
>>>>>
>>>>> I investigated a little bit and discovered the following behavior for
>>>>> Hive 0.11:
>>>>>
>>>>> - Hive 0.11 produces a different plan for these queries with incorrect
>>>>> results. The extra stage for the DISTRIBUTE BY + Transform is missing
and
>>>>> the Transform operator for the custom reducer script is pushed into the
>>>>> reduce operator tree containing GROUP BY itself.
>>>>>
>>>>> - However, *if the SORT BY in the query has a DESC order in it*, the
>>>>> right plan is produced, and the results look correct too.
>>>>>
>>>>> Hive 0.10 produces the expected plan with right results in all cases.
>>>>>
>>>>>
>>>>> To illustrate, here is a simplified repro setup:
>>>>>
>>>>> Table:
>>>>>
>>>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3
>>>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ',' LINES
>>>>> TERMINATED BY '\n' STORED AS TEXTFILE;*
>>>>>
>>>>> Query:
>>>>>
>>>>> *ADD FILE reducer.py;*
>>>>>
>>>>> *FROM(*
>>>>> *  SELECT grp, val2 *
>>>>> *  FROM test_cluster *
>>>>> *  GROUP BY grp, val2 *
>>>>> *  DISTRIBUTE BY grp *
>>>>> *  SORT BY grp, val2  -- add DESC here to get correct results*
>>>>> *) **a*
>>>>> *
>>>>> *
>>>>> *REDUCE a.**
>>>>> *USING 'reducer.py'*
>>>>> *AS grp, reducedValue*
>>>>>
>>>>>
>>>>> If i understand correctly, this is a bug. Is this a known issue? Any
>>>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect
>>>>> results while we investigate this.
>>>>>
>>>>> I have the repro sample, with test data and scripts, if anybody is
>>>>> interested.
>>>>>
>>>>>
>>>>>
>>>>> Thanks,
>>>>> pala
>>>>>
>>>>
>>>>
>>>
>>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message