hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-5996) Query for sum of a long column of a table with only two rows produces wrong result
Date Thu, 12 Dec 2013 20:31:08 GMT


Xuefu Zhang commented on HIVE-5996:

[~ehans] Thanks for sharing your thoughts and your inquiry. For your information, I'm not
trying to make MySQL as the model. My first line of consideration is SQL standard. For a functionality
if there is no SQL standard, Hive doesn't have to invent everything, thus, I do reference
MySQL for ideas, mostly because MySQL and its technical documentation are readily available.
However, this doesn't precludeme following other DB's practice. For instance, precision/scale
determination for arithmetic operations in hive is following SQL server's formula. I'm not
either anti- or pro- MySQL. Nor am I to SQL server, but I strongly believe that following
well-established practices benefits Hive than doing something in a unique, unfortunate way.
An example would be int/int in Hive.

However, a  lot of existing functionality in Hive was put into place when Hive is positioned
as a tool rather than DB, and before all necessary data types were introduced. Take int/int
again as an example, early developer probably didn't even think about SQL-compliance, and
even if he/she did, there wasn't decimal data type as a consideration. As Hive is shift to
a DB on bigdata positioning, I believe that we should start thinking in a perspective other
than performance or backward compatibility. If we restrict ourselves based on  unconscious
decisions made in the past, we may lose a lot of opportunities of doing the right things.

As I worked on decimal precision/scale support, I found a lot of problems in Hive about data
types and their conversions and promotions. In many cases, Hive is not consistent itself.
Let me ask you a question to see if you know the answer: what's the return type of 35 + '3.14',
where 35 is from int column and '3.14' from a string column? Before I made the changes, you
probably would say: wait, let me read the code first. And your answer might be different if
my question were 35/'3.14'. Now, to answer the same questions, I can give right way, and I
have a theory to tell why. In summary, it's a lot of effort to clean up the mess and inconsistency
in Hive from the beginning of my work on decimal.

Now if we use either performance or backward compatibility to shut down what we have achieved,
I don't see how Hive is shifting from a tool to a DB, and how Hive can become adopted as enterprise
grade product.

Hive is still evolving, and that's why I think we have certain luxury of breaking backward
compatibility for doing the right thing. As Ashutosh once mentioned, we don't want to be backward
compatible to a bug. Once Hive is stabilized, it becomes much harder to make backward incompatible
changes, as you know with your experience with SQL server.

I understand your concern about backward compatibility, especially your possible frustration
over vectorization breaking or redoing. On the other hand, I think we are here to help hive
become more useful. A blunt rejection without much consideration and communication doesn't
seem as helpful and constructive as it should be.

> Query for sum of a long column of a table with only two rows produces wrong result
> ----------------------------------------------------------------------------------
>                 Key: HIVE-5996
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: UDF
>    Affects Versions: 0.12.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>         Attachments: HIVE-5996.patch
> {code}
> hive> desc test2;
> OK
> l                   	bigint              	None                
> hive> select * from test2;                                 
> OK
> 6666666666666666666
> 5555555555555555555
> hive> select sum(l) from test2;
> OK
> -6224521851487329395
> {code}
> It's believed that a wrap-around error occurred. It's surprising that it happens only
with two rows. Same query in MySQL returns:
> {code}
> mysql> select sum(l) from test;
> +----------------------+
> | sum(l)               |
> +----------------------+
> | 12222222222222222221 |
> +----------------------+
> 1 row in set (0.00 sec)
> {code}
> Hive should accommodate large number of rows. Overflowing with only two rows is very

This message was sent by Atlassian JIRA

View raw message