Mailing-List: contact issues-help@hive.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@hive.apache.org
Date: Fri, 10 Feb 2017 22:00:43 +0000 (UTC)
From: "Wei Zheng (JIRA)" <jira@apache.org>
To: issues@hive.apache.org
Message-ID: <JIRA.13041917.1486693257000.52073.1486764043422@Atlassian.JIRA>
In-Reply-To: <JIRA.13041917.1486693257000@Atlassian.JIRA>
References: <JIRA.13041917.1486693257000@Atlassian.JIRA> <JIRA.13041917.1486693257221@jira-lw-us.apache.org>
Subject: [jira] [Commented] (HIVE-15872) The PERCENTILE UDAF does not work
 with empty set
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
archived-at: Fri, 10 Feb 2017 22:00:48 -0000


    [ https://issues.apache.org/jira/browse/HIVE-15872?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=3D1586=
1898#comment-15861898 ]=20

Wei Zheng commented on HIVE-15872:
----------------------------------

[~debugger87] Thanks for the patch. The fix looks good. Can you add a unit =
test for the failing case?

> The PERCENTILE UDAF does not work with empty set
> ------------------------------------------------
>
>                 Key: HIVE-15872
>                 URL: https://issues.apache.org/jira/browse/HIVE-15872
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>            Reporter: Chaozhong Yang
>            Assignee: Chaozhong Yang
>             Fix For: 2.1.2
>
>         Attachments: HIVE-15872.patch
>
>
> 1. Original SQL:
> select
>     percentile_approx(
>         column0,
>         array(0.50, 0.70, 0.90, 0.95, 0.99)
>     )
> from
>     my_table
> where
>     date =3D '20170207'
>     and column1 =3D 'value1'
>     and column2 =3D 'value2'
>     and column3 =3D 'value3'
>     and column4 =3D 'value4'
>     and column5 =3D 'value5'
> 2. Exception StackTrace:
> Error: java.lang.RuntimeException: org.apache.hadoop.hive.ql.metadata.Hiv=
eException: Hive Runtime Error while processing row (tag=3D0) {"key":{},"va=
lue":{"_col0":[0.0,10000.0]}} at org.apache.hadoop.hive.ql.exec.mr.ExecRedu=
cer.reduce(ExecReducer.java:256) at org.apache.hadoop.mapred.ReduceTask.run=
OldReducer(ReduceTask.java:453) at org.apache.hadoop.mapred.ReduceTask.run(=
ReduceTask.java:401) at org.apache.hadoop.mapred.YarnChild$2.run(YarnChild.=
java:163) at java.security.AccessController.doPrivileged(Native Method) at =
javax.security.auth.Subject.doAs(Subject.java:422) at org.apache.hadoop.sec=
urity.UserGroupInformation.doAs(UserGroupInformation.java:1671) at org.apac=
he.hadoop.mapred.YarnChild.main(YarnChild.java:158) Caused by: org.apache.h=
adoop.hive.ql.metadata.HiveException: Hive Runtime Error while processing r=
ow (tag=3D0) {"key":{},"value":{"_col0":[0.0,10000.0]}} at org.apache.hadoo=
p.hive.ql.exec.mr.ExecReducer.reduce(ExecReducer.java:244) ... 7 more Cause=
d by: org.apache.hadoop.hive.ql.metadata.HiveException: java.lang.IndexOutO=
fBoundsException: Index: 2, Size: 2 at org.apache.hadoop.hive.ql.exec.Group=
ByOperator.process(GroupByOperator.java:766) at org.apache.hadoop.hive.ql.e=
xec.mr.ExecReducer.reduce(ExecReducer.java:235) ... 7 more Caused by: java.=
lang.IndexOutOfBoundsException: Index: 2, Size: 2 at java.util.ArrayList.ra=
ngeCheck(ArrayList.java:653) at java.util.ArrayList.get(ArrayList.java:429)=
 at org.apache.hadoop.hive.ql.udf.generic.NumericHistogram.merge(NumericHis=
togram.java:134) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFPercen=
tileApprox$GenericUDAFPercentileApproxEvaluator.merge(GenericUDAFPercentile=
Approx.java:318) at org.apache.hadoop.hive.ql.udf.generic.GenericUDAFEvalua=
tor.aggregate(GenericUDAFEvaluator.java:188) at org.apache.hadoop.hive.ql.e=
xec.GroupByOperator.updateAggregations(GroupByOperator.java:612) at org.apa=
che.hadoop.hive.ql.exec.GroupByOperator.processAggr(GroupByOperator.java:85=
1) at org.apache.hadoop.hive.ql.exec.GroupByOperator.processKey(GroupByOper=
ator.java:695) at org.apache.hadoop.hive.ql.exec.GroupByOperator.process(Gr=
oupByOperator.java:761) ... 8 more
> 3. review data:
> select
>     column0
> from
>     my_table
> where
>     date =3D '20170207'
>     and column1 =3D 'value1'
>     and column2 =3D 'value2'
>     and column3 =3D 'value3'
>     and column4 =3D 'value4'
>     and column5 =3D 'value5'
> After run this sql, we found the result is NULL.
> 4. what's the meaning of [0.0, 10000.0] in stacktrace?
> In GenericUDAFPercentileApproxEvaluator, the method `merge` should proces=
s an ArrayList which name is partialHistogram. Normally, the basic structur=
e of partialHistogram is [npercentiles, percentile0, percentile1..., nbins,=
 bin0.x, bin0.y, bin1.x, bin1.y,...]. However, if we process NULL(empty set=
) column values, the partialHistoram will only contains [npercentiles(0), n=
bins(10000)]. That's the reason why the stacktrace shows a strange row data=
: {"key":{},"value":{"_col0":[0.0,10000.0]}}
> Before we call histogram#merge (on-line hisgoram algorithm from paper: ht=
tp://www.jmlr.org/papers/volume11/ben-haim10a/ben-haim10a.pdf ), the partia=
lHistogram should remove elements which store percentiles like `partialHist=
ogram.subList(0, nquantiles+1).clear();`. In the case of empty set, Generic=
UDAFPercentileApproxEvaluator will not remove percentiles. Consequently, Nu=
mericHistogram will merge a list which contains only 2 elements([0, 10000.0=
]) and throws IndexOutOfBoundsException.=20


--
This message was sent by Atlassian JIRA
(v6.3.15#6346)