hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eric Hanson (JIRA)" <>
Subject [jira] [Commented] (HIVE-5996) Query for sum of a long column of a table with only two rows produces wrong result
Date Thu, 12 Dec 2013 18:26:07 GMT


Eric Hanson commented on HIVE-5996:


I'm all for new, useful functionality and better performance for Hive. And I'm all for getting
correct results. I appreciate your contributions and your passion.

But I strongly believe changing behavior from one reasonable alternative to another in a way
that breaks backward compatibility is not the way to go. I have a lot of experience with evolving
a database (SQL Server) over a decade, and have talked to a many people who've been evolving
the product longer than that. From this experience, I can say that changing backward compatibility
(for either functionality or performance, but especially functionality) even in subtle ways
can anger customers/users. 

Any changes to semantics like this should first of all be avoided, and if they can't be avoided,
they need to be rolled out carefully, with a switch to enable backward compatibility. SQL
Server has compatibility levels and "SET options" as switches, and a defined deprecation schedule.
This is kind of process-heavy in the engineering effort, and also causes explosion of the
test matrix. So I am not recommending necessarily that Hive go there, though maybe we need
to have that discussion. I think we're better off being strict about not breaking backward
compatibility unless really needed.

So, I ask that you please close this JIRA without making a patch.

There are a couple of other areas where there is an issue of ANSI SQL compatibility (result
type of int/int and avg(int)). We could have a further discussion on those, though you know
my preference would be to leave the semantics as-is on those since I think backward compatibility
trumps ANSI SQL compatibility for those. If there is no issue of ANSI compatibility, and the
current Hive behavior is reasonable, I'd like us to leave things as they are. I don't think
there is a need to be across-the-board compatible with another system (MySQL or anything else).

Best regards,

P.S. Your specific argument that you can overflow a bigint sum, while technically accurate,
I think is not a significant user issue. I've never heard a complaint about it with SQL Server,
or PDW, our scale-out data warehouse appliance. Really big numbers, like the national debt
in pennies, fit in a bigint, just to put it in perspective. Users can cast the input to decimal
or double if they need more magnitude.

> Query for sum of a long column of a table with only two rows produces wrong result
> ----------------------------------------------------------------------------------
>                 Key: HIVE-5996
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5996.patch
> {code}
> hive> desc test2;
> OK
> l                   	bigint              	None                
> hive> select * from test2;                                 
> OK
> 6666666666666666666
> 5555555555555555555
> hive> select sum(l) from test2;
> OK
> -6224521851487329395
> {code}
> It's believed that a wrap-around error occurred. It's surprising that it happens only
with two rows. Same query in MySQL returns:
> {code}
> mysql> select sum(l) from test;
> +----------------------+
> | sum(l)               |
> +----------------------+
> | 12222222222222222221 |
> +----------------------+
> 1 row in set (0.00 sec)
> {code}
> Hive should accommodate large number of rows. Overflowing with only two rows is very

This message was sent by Atlassian JIRA

View raw message