hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chao Sun (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17010) Fix the overflow problem of Long type in SetSparkReducerParallelism
Date Thu, 06 Jul 2017 03:19:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17010?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16075852#comment-16075852
] 

Chao Sun commented on HIVE-17010:
---------------------------------

Ah I see. Sometimes the stats estimation could generate negative values, in which case Hive
will use {{Long.MAX_VALUE}} for both # of rows and data size could be. One case I observed
previously:
{code}
not ((P1 or P2) or P3)
{code}
When no column stats are available, Hive will simply divide the # of input rows by 2 for each
predicate evaluation. Suppose the total input rows is 10, then {{P1}}, {{P2}} and {{P3}} will
yield 5 respectively. Operator {{or}} adds value from both sides so the expression {{((P1
or P2) or P3)}} generates 30 rows. The operator {{not}}, on the other hand, will subtract
the value of its associated expression from the total input rows. Therefore in the end you
will get {{10 - 30 = -20}}.

For the solution you proposed, I'm inclined to use {{StatsUtils.safeAdd}}, but either way
should be fine.

> Fix the overflow problem of Long type in SetSparkReducerParallelism
> -------------------------------------------------------------------
>
>                 Key: HIVE-17010
>                 URL: https://issues.apache.org/jira/browse/HIVE-17010
>             Project: Hive
>          Issue Type: Bug
>            Reporter: liyunzhang_intel
>            Assignee: liyunzhang_intel
>         Attachments: HIVE-17010.1.patch
>
>
> We use [numberOfByteshttps://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L129]
to collect the numberOfBytes of sibling of specified RS. We use Long type and it happens overflow
when the data is too big. After happening this situation, the parallelism is decided by [sparkMemoryAndCores.getSecond()|https://github.com/apache/hive/blob/master/ql/src/java/org/apache/hadoop/hive/ql/optimizer/spark/SetSparkReducerParallelism.java#L184]
if spark.dynamic.allocation.enabled is true, sparkMemoryAndCores.getSecond is a dymamic value
which is decided by spark runtime. For example, the value of sparkMemoryAndCores.getSecond
is 5 or 15 randomly. There is possibility that the value may be 1. The may problem here is
the overflow of addition of Long type.  You can reproduce the overflow problem by following
code
> {code}
>     public static void main(String[] args) {
>       long a1= 9223372036854775807L;
>       long a2=1022672;
>       long res = a1+a2;
>       System.out.println(res);  //-9223372036853753137
>       BigInteger b1= BigInteger.valueOf(a1);
>       BigInteger b2 = BigInteger.valueOf(a2);
>       BigInteger bigRes = b1.add(b2);
>       System.out.println(bigRes); //9223372036855798479
>     }
> {code}



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message