hive-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sankar Hariappan (JIRA)" <j...@apache.org>
Subject [jira] [Created] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
Date Tue, 26 Sep 2017 13:24:00 GMT
Sankar Hariappan created HIVE-17608:
---------------------------------------

             Summary: REPL LOAD should overwrite the data files if exists instead of duplicating
it
                 Key: HIVE-17608
                 URL: https://issues.apache.org/jira/browse/HIVE-17608
             Project: Hive
          Issue Type: Sub-task
          Components: HiveServer2, repl
    Affects Versions: 3.0.0
            Reporter: Sankar Hariappan
            Assignee: Sankar Hariappan
             Fix For: 3.0.0


This is to make insert event idempotent.

Currently, MoveTask would create a new file if the destination folder contains a file of the
same name. This is wrong if we have the same file in both bootstrap dump and incremental dump
(by design, duplicate file in incremental dump will be ignored for idempotent reason), we
will get duplicate files eventually. Also it is wrong to just retain the filename in the staging
folder. Suppose we get the same insert event twice, the first time we get the file from source
table folder, the second time we get the file from cm, we still end up with duplicate copy.
The right solution is to keep the same file name as the source table folder.
To do that, we can put the original filename in MoveWork, and in MoveTask, if original filename
is set, don't generate a new name, simply overwrite. We need to do it in both bootstrap and
incremental load.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message