hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sergey Shelukhin (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-13957) vectorized IN is inconsistent with non-vectorized (at least for decimal in (string))
Date Tue, 07 Jun 2016 19:20:21 GMT

    [ https://issues.apache.org/jira/browse/HIVE-13957?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15319171#comment-15319171
] 

Sergey Shelukhin commented on HIVE-13957:
-----------------------------------------

Can you elaborate? The problem is the difference between the approaches, not the type of the
cast per se.

Normal IN, at some point that doesn't really matter, derives the common type for the column
and constants, and casts both columns and constants to that whenever needed.
Whereas vectorization always tries to convert the constants to the column type, the reason
being (I supposed) that the specializations for IN all have a particular column type in mind.
I am not actually very familiar with these and whether it would be easy to incorporate a cast;
I assume the cast of the column would need to come earlier than the specialized IN (i.e. specialized
IN should already be able to utilize values of the correct type straight out of the VRB),
which would require the vectorizer to modify the plan above the IN. Or something like that.

We could do that, however, as far as I see, it's not the solution we want, because of the
following.
First, in case of decimal-string, this issue can produce incorrect results, so we want a simple
fix for that, which the above isn't.
>From the long term perspective, I'd say we need to prohibit implicit casts in this case
(I opened a separate JIRA) AND/OR change non-vectorized pipeline rather than vectorized, because
casting decimal column to string in this case (what the non-vectorized IN does) is not the
intuitively logical thing for the user and may produce unexpected result.

With the latter in mind, we /could/ fix the proximate issue in vectorized code (cast to decimal(38,38)
that ends up converting all reasonable values to null), e.g. constrain the precision and scale
to the column type (potentially +2/+1 for NOT, although the enforcement will probably convert
the values that don't fit to NULL), assuming the values are trimmed, since more should never
be needed. But that's still inconsistent with normal IN, and we should probably do it later.

Actually, come think of it, this might also be broken for other UDFs, where constraining is
not as easy or at least is different (e.g. between needs more than strict equality, and with
arithmetic ops, if this problem applies, the only way would be to derive the maximum values
from the value list). I can also file a separate JIRA for that...


> vectorized IN is inconsistent with non-vectorized (at least for decimal in (string))
> ------------------------------------------------------------------------------------
>
>                 Key: HIVE-13957
>                 URL: https://issues.apache.org/jira/browse/HIVE-13957
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>            Assignee: Sergey Shelukhin
>         Attachments: HIVE-13957.patch, HIVE-13957.patch
>
>
> The cast is applied to the column in regular IN, but vectorized IN applies it to the
IN() list



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message