hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Xuefu Zhang (JIRA)" <>
Subject [jira] [Commented] (HIVE-5356) Move arithmatic UDFs to generic UDF implementations
Date Wed, 11 Dec 2013 03:59:07 GMT


Xuefu Zhang commented on HIVE-5356:

1. The changes to floating point arithmetic are not backward compatible, and there is no SQL
compliance benefit for that.
The main reason is to be in line with MySQL and simplify the implementation. It could be kept
in backward compatible manner.

2.2 It will not be backward compatible with some udf implementations ( I believe this is same
with change in floating point return type).
SQL standard says that exact type division should result exact type. Double is non-compliant.
Changing to int type has the same issue you're referring to.

2.2 Integer arithmetic becoming NULL in some cases
First, I don't think there is any standard saying that integer operation should not emit NULL.

NULL is generated when an error occurs (such as overflow, divide by zero, etc. Currently emitting
NULL is one of the few error handling options a modern databases have, but is the only one
that hive has, though Hive isn't consistent. I'd argue generating NULL value is worse than
generating bad or wrong values in error cases. To make things worse, user is not aware of
that. (Take HIVE-5660 as an example.)

We may introduce different server mode to config different error handling (HIVE-5438).

2.3 more than 50x performance degradation for the arithmetic operation

50x performance degradation came from a unit test, which doesn't necessary represents the
Hive overall performance. Hive's performance will not be judged solely by int/int. The bigger
question is: do we need some thing that's working and right, or something that's doing bad
and fast. Performance can be improved down the road, but functionality deviations are hard
to correct, as has been demonstrated in this discussion.

Backward compatibility is a valid concern. However, the question is whether Hive is at a point
where this has to be kept with any cost or we are willing to sacrifice some and achieve something
that we think right.

I have seen arguments from points of implementation over functionality, performance over correctness,
which is, in my opinion, ill-constructed.

> Move arithmatic UDFs to generic UDF implementations
> ---------------------------------------------------
>                 Key: HIVE-5356
>                 URL:
>             Project: Hive
>          Issue Type: Task
>          Components: UDF
>    Affects Versions: 0.11.0
>            Reporter: Xuefu Zhang
>            Assignee: Xuefu Zhang
>             Fix For: 0.13.0
>         Attachments: HIVE-5356.1.patch, HIVE-5356.10.patch, HIVE-5356.11.patch, HIVE-5356.12.patch,
HIVE-5356.2.patch, HIVE-5356.3.patch, HIVE-5356.4.patch, HIVE-5356.5.patch, HIVE-5356.6.patch,
HIVE-5356.7.patch, HIVE-5356.8.patch, HIVE-5356.9.patch
> Currently, all of the arithmetic operators, such as add/sub/mult/div, are implemented
as old-style UDFs and java reflection is used to determine the return type TypeInfos/ObjectInspectors,
based on the return type of the evaluate() method chosen for the expression. This works fine
for types that don't have type params.
> Hive decimal type participates in these operations just like int or double. Different
from double or int, however, decimal has precision and scale, which cannot be determined by
just looking at the return type (decimal) of the UDF evaluate() method, even though the operands
have certain precision/scale. With the default of "decimal" without precision/scale, then
(10, 0) will be the type params. This is certainly not desirable.
> To solve this problem, all of the arithmetic operators would need to be implemented as
GenericUDFs, which allow returning ObjectInspector during the initialize() method. The object
inspectors returned can carry type params, from which the "exact" return type can be determined.
> It's worth mentioning that, for user UDF implemented in non-generic way, if the return
type of the chosen evaluate() method is decimal, the return type actually has (10,0) as precision/scale,
which might not be desirable. This needs to be documented.
> This JIRA will cover minus, plus, divide, multiply, mod, and pmod, to limit the scope
of review. The remaining ones will be covered under HIVE-5706.

This message was sent by Atlassian JIRA

View raw message