hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Varga (Jira)" <>
Subject [jira] [Commented] (HIVE-23703) Major QB compaction with multiple FileSinkOperators results in data loss and one original file
Date Wed, 01 Jul 2020 07:03:00 GMT


Peter Varga commented on HIVE-23703:

[~klcopp] looks like one of the new test is flaky:

> Major QB compaction with multiple FileSinkOperators results in data loss and one original
> ----------------------------------------------------------------------------------------------
>                 Key: HIVE-23703
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Karen Coppage
>            Assignee: Karen Coppage
>            Priority: Critical
>              Labels: compaction, pull-request-available
>             Fix For: 4.0.0
>          Time Spent: 1h 20m
>  Remaining Estimate: 0h
> h4. Problems
> Example:
> {code:java}
> drop table if exists tbl2;
> create transactional table tbl2 (a int, b int) clustered by (a) into 4 buckets stored
as ORC TBLPROPERTIES('transactional'='true','transactional_properties'='default');
> insert into tbl2 values(1,2),(1,3),(1,4),(2,2),(2,3),(2,4);
> insert into tbl2 values(3,2),(3,3),(3,4),(4,2),(4,3),(4,4);
> insert into tbl2 values(5,2),(5,3),(5,4),(6,2),(6,3),(6,4);{code}
> E.g. in the example above, bucketId=0 when a=2 and a=6.
> 1. Data loss 
>  In non-acid tables, an operator's temp files are named with their task id. Because of
this snippet, temp files in the FileSinkOperator for compaction tables are identified by their
> {code:java}
> if (conf.isCompactionTable()) {
>  fsp.initializeBucketPaths(filesIdx, AcidUtils.BUCKET_PREFIX + String.format(AcidUtils.BUCKET_DIGITS,
>  isNativeTable(), isSkewedStoredAsSubDirectories);
>  } else {
>  fsp.initializeBucketPaths(filesIdx, taskId, isNativeTable(), isSkewedStoredAsSubDirectories);
>  }
> {code}
> So 2 temp files containing data with a=2 and a=6 will be named bucket_0 and not 000000_0
and 000000_1 as they would normally.
>  In FileSinkOperator.commit, when data with a=2, filename: bucket_0 is moved from _task_tmp.-ext-10002
to _tmp.-ext-10002, it overwrites the files already there with a=6 data, because it too is
named bucket_0. You can see in the logs:
> {code:java}
>  WARN [LocalJobRunner Map Task Executor #0] exec.FileSinkOperator: Target path file:.../hive/ql/target/tmp/org.apache.hadoop.hive.ql.TestTxnNoBuckets-1591107230237/warehouse/testmajorcompaction/base_0000002_v0000013/.hive-staging_hive_2020-06-02_07-15-21_771_8551447285061957908-1/_tmp.-ext-10002/bucket_00000
with a size 610 exists. Trying to delete it.
> {code}
> 2. Results in one original file
>  OrcFileMergeOperator merges the results of the FSOp into 1 file named 000000_0.
> h4. Fix
> 1. FSOp will store data as: taskid/bucketId. e.g. 0_0/bucket_0
> 2. OrcMergeFileOp, instead of merging a bunch of files into 1 file named 000000_0, will
merge all files named bucket_0 into one file named bucket_0, and so on.
> 3. MoveTask will get rid of the taskId directories if present and only move the bucket
files in them, in case OrcMergeFileOp is not run.

This message was sent by Atlassian Jira

View raw message