hive-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pala M Muthaia <mchett...@rocketfuelinc.com>
Subject Re: DISTRIBUTE BY works incorrectly in Hive 0.11 in some cases
Date Mon, 26 Aug 2013 19:02:54 GMT
Thanks for following up Yin.

We realized later this was due to the reduce deduplication optimization,
and found turning off the flag avoids the issue.

-pala


On Mon, Aug 26, 2013 at 4:40 AM, Yin Huai <huaiyin.thu@gmail.com> wrote:

> forgot to add in my last reply.... To generate correct results, you can
> set hive.optimize.reducededuplication to false to turn off
> ReduceSinkDeDuplication
>
>
> On Sun, Aug 25, 2013 at 9:35 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:
>
> > Created a jira https://issues.apache.org/jira/browse/HIVE-5149
> >
> >
> > On Sun, Aug 25, 2013 at 9:11 PM, Yin Huai <huaiyin.thu@gmail.com> wrote:
> >
> >> Seems ReduceSinkDeDuplication picked the wrong partitioning columns.
> >>
> >>
> >> On Fri, Aug 23, 2013 at 9:15 PM, Shahansad KP <skp@rocketfuel.com>
> wrote:
> >>
> >>> I think the problem lies with in the group by operation. For this
> >>> optimization to work the group bys partitioning should be on the column
> >>> 1 only.
> >>>
> >>> It wont effect the correctness of group by, can make it slow but int
> >>> this case will fasten the overall query performance.
> >>>
> >>>
> >>> On Fri, Aug 23, 2013 at 5:55 PM, Pala M Muthaia <
> >>> mchettiar@rocketfuelinc.com> wrote:
> >>>
> >>>> I have attached the hive 10 and 11 query plans, for the sample query
> >>>> below, for illustration.
> >>>>
> >>>>
> >>>> On Fri, Aug 23, 2013 at 5:35 PM, Pala M Muthaia <
> >>>> mchettiar@rocketfuelinc.com> wrote:
> >>>>
> >>>>> Hi,
> >>>>>
> >>>>> We are using DISTRIBUTE BY with custom reducer scripts in our query
> >>>>> workload.
> >>>>>
> >>>>> After upgrade to Hive 0.11, queries with GROUP BY/DISTRIBUTE BY/SORT
> >>>>> BY and custom reducer scripts produced incorrect results.
> Particularly,
> >>>>> rows with same value on DISTRIBUTE BY column ends up in multiple
> reducers
> >>>>> and thus produce multiple rows in final result, when we expect only
> one.
> >>>>>
> >>>>> I investigated a little bit and discovered the following behavior
for
> >>>>> Hive 0.11:
> >>>>>
> >>>>> - Hive 0.11 produces a different plan for these queries with
> incorrect
> >>>>> results. The extra stage for the DISTRIBUTE BY + Transform is
> missing and
> >>>>> the Transform operator for the custom reducer script is pushed into
> the
> >>>>> reduce operator tree containing GROUP BY itself.
> >>>>>
> >>>>> - However, *if the SORT BY in the query has a DESC order in it*,
the
> >>>>> right plan is produced, and the results look correct too.
> >>>>>
> >>>>> Hive 0.10 produces the expected plan with right results in all cases.
> >>>>>
> >>>>>
> >>>>> To illustrate, here is a simplified repro setup:
> >>>>>
> >>>>> Table:
> >>>>>
> >>>>> *CREATE TABLE test_cluster (grp STRING, val1 STRING, val2 INT, val3
> >>>>> STRING, val4 INT) ROW FORMAT DELIMITED FIELDS TERMINATED BY ','
LINES
> >>>>> TERMINATED BY '\n' STORED AS TEXTFILE;*
> >>>>>
> >>>>> Query:
> >>>>>
> >>>>> *ADD FILE reducer.py;*
> >>>>>
> >>>>> *FROM(*
> >>>>> *  SELECT grp, val2 *
> >>>>> *  FROM test_cluster *
> >>>>> *  GROUP BY grp, val2 *
> >>>>> *  DISTRIBUTE BY grp *
> >>>>> *  SORT BY grp, val2  -- add DESC here to get correct results*
> >>>>> *) **a*
> >>>>> *
> >>>>> *
> >>>>> *REDUCE a.**
> >>>>> *USING 'reducer.py'*
> >>>>> *AS grp, reducedValue*
> >>>>>
> >>>>>
> >>>>> If i understand correctly, this is a bug. Is this a known issue?
Any
> >>>>> other insights? We have reverted to Hive 0.10 to avoid the incorrect
> >>>>> results while we investigate this.
> >>>>>
> >>>>> I have the repro sample, with test data and scripts, if anybody
is
> >>>>> interested.
> >>>>>
> >>>>>
> >>>>>
> >>>>> Thanks,
> >>>>> pala
> >>>>>
> >>>>
> >>>>
> >>>
> >>
> >
>

Mime
View raw message