hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Remus Rusanu <rem...@microsoft.com>
Subject RE: A question about the derivation of intermediate sum field for decimal average aggregates
Date Fri, 14 Feb 2014 22:09:02 GMT
Hi Xuefu,

I do not have any particular use case in mind. I've noticed the problem when I implemented
the vectorized AVG for decimal, which must match your implementation (since we vectorized
only the map side operator, it better produce the output expected by the reduce side...).
 I thought that since we alter the precission/scale for the result, we may as well alter it
for the intermediate sum field. But if this complicates the use of object inspectors and introduces
maintenance risks, then is probably not worth it.

Thanks,
~Remus

-----Original Message-----
From: Xuefu Zhang [mailto:xzhang@cloudera.com] 
Sent: Friday, February 14, 2014 7:18 PM
To: dev@hive.apache.org
Cc: xuefu@apache.org; Eric Hanson (BIG DATA)
Subject: Re: A question about the derivation of intermediate sum field for decimal average
aggregates

Remus,

Thanks for looking into this. You're right that sum() result doesn't increase the scale, but
have you seen that sum UDF returns wrong scale?

As to the implementation of avg UDF, the object inspector for sum field is initialized with
a scale + 4, which might not be necessary, but perhaps harmless. The same object inspector
is also used for the average result, which gives correct type. I guess it's possible to separate
this into two object inspectors, one for sum field and one for the avg result, but the difference
might be subtle and questionable. This is because the data may not comply to the metadata
specified for Hive tables. Thus, I'm not sure if truncating data before it's summed if the
right behavior.

Do you have a use case that suggests one is better than the other?

--Xuefu


On Fri, Feb 14, 2014 at 3:55 AM, Remus Rusanu <remusr@microsoft.com> wrote:

> Hi,
>
> With HIVE-5872 the intermediate sum field for decimal aggregates was 
> changed to increase scale by 4. I understand the reasoning for having 
> accurate precision/scale for the aggregate output. However, for the 
> intermediate sum field of AVG, I believe we should increase precision 
> w/o increasing scale. The sum can grow large, but cannot increase 
> digits in the fractional part, so we should increase the precision of 
> the sum, but not the scale. When sum is divided by count to get the 
> average on the reduce side then we should indeed project value with higher scale.
>
> Opinions?
>
> Thanks,
> ~Remus
>
>

Mime
View raw message