hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <>
Subject [jira] [Commented] (HIVE-16223) deterministic file naming for bucketing in Hive
Date Wed, 15 Mar 2017 21:00:44 GMT


Eugene Koifman commented on HIVE-16223:

+1 we must have well defined (enforced) rules about directory structure/file names.

> deterministic file naming for bucketing in Hive
> -----------------------------------------------
>                 Key: HIVE-16223
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Sergey Shelukhin
> Bucketing in Hive is currently very fragile.
> 1) Some places determine bucket number from file name.
> 2) Some places determine bucket number from a file's "index" in a sorted list of files
in the directory.
> 3) It is possible to import files into a bucketed table without any regard for either.
> On top of that, weird rename paths (like _copy_1), subdirectories (e.g. from Tez union,
or just tables read with recursive input enabled), repeated inserts into the same table, etc.
can mess with either scheme.
> Therefore I propose we include bucket index and count explicitly in the file name (e.g.
000003_0_bucket_3of32). It will alleviate the above, and also may simplify some pieces of
code that try to account for missing bucket files, multiple files, etc.
> This will require changes to load table logic that is used in ctas, insert, load, import
etc.; change in logic when getting buckets, as well as when altering table bucketing (to rename
the files).
> Users will still be able to use old-style bucketing by specifying a non-strict config
setting (not on by default).
> The conversion of existing tables is the biggest issue. Perhaps the existing tables can
be "grandfathered" into the non-strict bucketing, with some warnings asking the users to convert,
and a command to do so in alter table/analyze table.

This message was sent by Atlassian JIRA

View raw message