hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Deepak Jaiswal (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-18350) load data should rename files consistent with insert statements
Date Thu, 01 Feb 2018 03:24:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-18350?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Deepak Jaiswal updated HIVE-18350:
----------------------------------
    Attachment: HIVE-18350.8.patch

> load data should rename files consistent with insert statements
> ---------------------------------------------------------------
>
>                 Key: HIVE-18350
>                 URL: https://issues.apache.org/jira/browse/HIVE-18350
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Deepak Jaiswal
>            Assignee: Deepak Jaiswal
>            Priority: Major
>         Attachments: HIVE-18350.1.patch, HIVE-18350.2.patch, HIVE-18350.3.patch, HIVE-18350.4.patch,
HIVE-18350.5.patch, HIVE-18350.6.patch, HIVE-18350.7.patch, HIVE-18350.8.patch
>
>
> Insert statements create files of format ending with 0000_0, 0001_0 etc. However, the
load data uses the input file name. That results in inconsistent naming convention which makes
SMB joins difficult in some scenarios and may cause trouble for other types of queries in
future.
> We need consistent naming convention.
> For non-bucketed table, hive renames all the files regardless of how they were named
by the user.
>  For bucketed table, hive relies on user to name the files matching the bucket in non-strict
mode. Hive assumes that the data belongs to same bucket in a file. In strict mode, loading
bucketed table is disabled.
> This will likely affect most of the tests which load data which is pretty significant
due to which it is further divided into two subtasks for smoother merge.
> For existing tables in customer database, it is recommended to reload bucketed tables
otherwise if customer tries to run SMB join and there is a bucket for which there is no split,
then there is a possibility of getting incorrect results. However, this is not a regression
as it would happen even without the patch.
> With this patch however, and reloading data, the results should be correct.
> For non-bucketed tables and external tables, there is no difference in behavior and reloading
data is not needed.



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message