hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Thomas Poepping (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-15852) Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException
Date Wed, 08 Feb 2017 20:14:41 GMT

    [ https://issues.apache.org/jira/browse/HIVE-15852?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15858485#comment-15858485
] 

Thomas Poepping commented on HIVE-15852:
----------------------------------------

[~ashutoshc] Ashutosh, sorry it took so long to open this Jira issue. Here's a summary of
what I've found so far. While it's the easiest solution, I really don't want to revert HIVE-13040,
I think the performance gains can be large, especially in the blobstore (s3a or azure) case,
as empty file creation is far from free.

Happy to hear suggestions, and start a conversation.

> Tablesampling on Tez in low-record case throws ArrayIndexOutOfBoundsException
> -----------------------------------------------------------------------------
>
>                 Key: HIVE-15852
>                 URL: https://issues.apache.org/jira/browse/HIVE-15852
>             Project: Hive
>          Issue Type: Bug
>          Components: Tez
>    Affects Versions: 2.1.1
>            Reporter: Thomas Poepping
>
> Due to HIVE-13040 ( https://issues.apache.org/jira/browse/HIVE-13040 ), which doesn't
create empty files to represent empty buckets when Hive is on Tez, a couple things are broken.
> First of all, if there are empty buckets (which is possible with large datasets in the
partitioned-bucketed case), tablesampling will not work if you're referencing a bucket number
higher than the number of files.
> e.g. In some partition 'p', there are three rows. The table 't' is clustered into ten
buckets. With maximal hashing, only three bucket files will be created. If we do select *
from t tablesample (bucket x out of 10) where <selecting from p> (where x > 3), an
ArrayIndexOutOfBoundsException will be thrown because Hive assumes there are only three buckets.
> Second, other applications (such as Pig) may be making assumptions about the number of
files equaling the number of buckets.
> Possible fixes:
> * Revert HIVE-13040
> * Change how tablesampling is implemented to accept possibility that number of files
!= number of buckets
> ** Would require coordination across projects to change assumptions
> Things to consider:
> * what performance gains are there from not creating empty files?
> * if the gains are large, are we willing to lose them? (by reverting HIVE-13040)
> * _how else can we avoid creating unnecessary files, while still maintaining invariants
other applications expect?_



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message