hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Hive QA (JIRA)" <>
Subject [jira] [Commented] (HIVE-17608) REPL LOAD should overwrite the data files if exists instead of duplicating it
Date Wed, 04 Oct 2017 16:23:00 GMT


Hive QA commented on HIVE-17608:

Here are the results of testing the latest attachment:

{color:green}SUCCESS:{color} +1 due to 4 test(s) being added or modified.

{color:red}ERROR:{color} -1 due to 6 failed/errored test(s), 11200 tests executed
*Failed tests:*
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[optimize_nullscan] (batchId=162)
org.apache.hadoop.hive.cli.TestMiniLlapLocalCliDriver.testCliDriver[union_fast_stats] (batchId=157)
org.apache.hadoop.hive.cli.TestTezPerfCliDriver.testCliDriver[query23] (batchId=240)

Test results:
Console output:
Test logs:

Executing org.apache.hive.ptest.execution.TestCheckPhase
Executing org.apache.hive.ptest.execution.PrepPhase
Executing org.apache.hive.ptest.execution.ExecutionPhase
Executing org.apache.hive.ptest.execution.ReportingPhase
Tests exited with: TestsFailedException: 6 tests failed

This message is automatically generated.

ATTACHMENT ID: 12890319 - PreCommit-HIVE-Build

> REPL LOAD should overwrite the data files if exists instead of duplicating it
> -----------------------------------------------------------------------------
>                 Key: HIVE-17608
>                 URL:
>             Project: Hive
>          Issue Type: Sub-task
>          Components: HiveServer2, repl
>    Affects Versions: 3.0.0
>            Reporter: Sankar Hariappan
>            Assignee: Sankar Hariappan
>              Labels: DR, pull-request-available, replication
>             Fix For: 3.0.0
>         Attachments: HIVE-17608.01.patch, HIVE-17608.02.patch
> This is to make insert event idempotent.
> Currently, MoveTask would create a new file if the destination folder contains a file
of the same name. This is wrong if we have the same file in both bootstrap dump and incremental
dump (by design, duplicate file in incremental dump will be ignored for idempotent reason),
we will get duplicate files eventually. Also it is wrong to just retain the filename in the
staging folder. Suppose we get the same insert event twice, the first time we get the file
from source table folder, the second time we get the file from cm, we still end up with duplicate
copy. The right solution is to keep the same file name as the source table folder.
> To do that, we can put the original filename in MoveWork, and in MoveTask, if original
filename is set, don't generate a new name, simply overwrite. We need to do it in both bootstrap
and incremental load.

This message was sent by Atlassian JIRA

View raw message