hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-17138) FileSinkOperator doesn't create empty files for acid path
Date Mon, 11 Sep 2017 23:21:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16162213#comment-16162213
] 

Eugene Koifman commented on HIVE-17138:
---------------------------------------

See also TestTxnNoBuckets.testUnionRemove()
{noformat}
    int[][] values = {{1,2},{3,4},{5,6},{7,8},{9,10}};
    runStatementOnDriver("insert into " + TxnCommandsBaseForTests.Table.ACIDTBL + makeValuesClause(values));//this
creates 1 delta_0000013_0000013_0000/bucket_00001
{noformat}
but before "move" in Hive I see 2 buckets - so we drop then empty 0 bucket during move somewhere...

> FileSinkOperator doesn't create empty files for acid path
> ---------------------------------------------------------
>
>                 Key: HIVE-17138
>                 URL: https://issues.apache.org/jira/browse/HIVE-17138
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> For bucketed tables, FileSinkOperator is expected (in some cases)  to produce a specific
number of files even if they are empty.
> FileSinkOperator.closeOp(boolean abort) has logic to create files even if empty.
> This doesn't property work for Acid path.  For Insert, the OrcRecordUpdater(s) is set
up in createBucketForFileIdx() which creates the actual bucketN file (as of HIVE-14007, it
does it regardless of whether RecordUpdater sees any rows).  This causes empty (i.e.ORC metadata
only) bucket files to be created for multiFileSpray=true if a particular FileSinkOperator.process()
sees at least 1 row.  For example,
> {noformat}
> create table fourbuckets (a int, b int) clustered by (a) into 4 buckets stored as orc
TBLPROPERTIES ('transactional'='true');
> insert into fourbuckets values(0,1),(1,1);
> with mapreduce.job.reduces = 1 or 2 
> {noformat}
> For Update/Delete path, OrcRecordWriter is created lazily when the 1st row that needs
to land there is seen.  Thus it never creates empty buckets no mater what the value of _skipFiles_
in closeOp(boolean).
> Once Split Update does the split early (in operator pipeline) only the Insert path will
matter since base and delta are the only files split computation, etc looks at.  delete_delta
is only for Acid internals so there is never any reason for create empty files there.
> Also make sure to close RecordUpdaters in FileSinkOperator.abortWriters()



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message