hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-5878) Hive standard avg UDAF returns double as the return type for some exact input types
Date Fri, 06 Dec 2013 18:27:35 GMT

    [ https://issues.apache.org/jira/browse/HIVE-5878?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13841516#comment-13841516
] 

Xuefu Zhang commented on HIVE-5878:
-----------------------------------

[~ehans] Thank you for your concern. However, I respectfully disagree that the behavior WAS
and IS reasonable for several reasons. First, AVG was probably introduced before decimal,
so there was no better choice than double. Hive has the concept of exact types (int, long,
decimal, etc.) vs approximate types (double, float, etc), and Arithmetic operations (plus,
divide, etc) on exact types generates exact type for accuracy. If average is defined mathematically
as sum/count, then sum(int)/count should result an exact type. Otherwise, avg() and sum()/count
give different result. Another inconsistency exists when avg(decimal) results a decimal. All
those cause inconsistency in Hive's mathematical concept and function behavior, and can create
confusions among users as well.

I understand vectorized current implementation chooses double for sum and uses sum/count to
get another double for average. While this extends the scope of the changes, to me, however,
vectorization is just implementation, which should not dictate high-level concept and consistency.

> Hive standard avg UDAF returns double as the return type for some exact input types
> -----------------------------------------------------------------------------------
>
>                 Key: HIVE-5878
>                 URL: https://issues.apache.org/jira/browse/HIVE-5878
>             Project: Hive
>          Issue Type: Bug
>          Components: Types, UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5878.1.patch, HIVE-5878.patch
>
>
> For standard, no-partial avg result, hive currently returns double as the result type.
> {code}
> hive> desc test;
> OK
> d                   	int                 	None                
> Time taken: 0.051 seconds, Fetched: 1 row(s)
> hive> explain select avg(`d`) from test;  
> ...
>       Reduce Operator Tree:
>         Group By Operator
>           aggregations:
>                 expr: avg(VALUE._col0)
>           bucketGroup: false
>           mode: mergepartial
>           outputColumnNames: _col0
>           Select Operator
>             expressions:
>                   expr: _col0
>                   type: double
> {code}
> However, exact types including integers and decimal should yield exact type. Here is
what MySQL does:
> {code}
> mysql> desc test;
> +-------+--------------+------+-----+---------+-------+
> | Field | Type         | Null | Key | Default | Extra |
> +-------+--------------+------+-----+---------+-------+
> | i     | int(11)      | YES  |     | NULL    |       |
> | b     | tinyint(1)   | YES  |     | NULL    |       |
> | d     | double       | YES  |     | NULL    |       |
> | s     | varchar(5)   | YES  |     | NULL    |       |
> | dd    | decimal(5,2) | YES  |     | NULL    |       |
> +-------+--------------+------+-----+---------+-------+
> mysql> create table test62 as select avg(i) from test;
> mysql> desc test62;
> +-------+---------------+------+-----+---------+-------+
> | Field | Type          | Null | Key | Default | Extra |
> +-------+---------------+------+-----+---------+-------+
> | avg(i) | decimal(14,4) | YES  |     | NULL    |       |
> +-------+---------------+------+-----+---------+-------+
> 1 row in set (0.00 sec)
> {code}



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message