hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Edward Capriolo (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-16223) deterministic file naming for bucketing in Hive
Date Wed, 15 Mar 2017 20:27:41 GMT

    [ https://issues.apache.org/jira/browse/HIVE-16223?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15926923#comment-15926923
] 

Edward Capriolo commented on HIVE-16223:
----------------------------------------

+1. It would be also good if where clauses could target specfic buckets more easily WHERE
bucket=1of24 or where bucketof(column)=5

> deterministic file naming for bucketing in Hive
> -----------------------------------------------
>
>                 Key: HIVE-16223
>                 URL: https://issues.apache.org/jira/browse/HIVE-16223
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
>
> Bucketing in Hive is currently very fragile.
> 1) Some places determine bucket number from file name.
> 2) Some places determine bucket number from a file's "index" in a sorted list of files
in the directory.
> 3) It is possible to import files into a bucketed table without any regard for either.
> On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from Tez union,
or just tables read with recursive input enabled), repeated inserts into the same table, etc.
can mess with either scheme.
> Therefore I propose we include bucket index and count explicitly in the file name (e.g.
000003_0_bucket_3of32). It will alleviate the above, and also may simplify some pieces of
code that try to account for missing bucket files, multiple files, etc.
> This will require changes to load table logic that is used in ctas, insert, load, import
etc.; change in logic when getting buckets, as well as when altering table bucketing (to rename
the files).
> Users will still be able to use old-style bucketing by specifying a non-strict config
setting (not on by default).
> The conversion of existing tables is the biggest issue. Perhaps the existing tables can
be "grandfathered" into the non-strict bucketing, with some warnings asking the users to convert,
and a command to do so in alter table/analyze table.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message