hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Eugene Koifman (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (HIVE-17138) FileSinkOperator doesn't create empty files for acid path
Date Thu, 20 Jul 2017 20:27:00 GMT

     [ https://issues.apache.org/jira/browse/HIVE-17138?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Eugene Koifman updated HIVE-17138:
----------------------------------
    Description: 
For bucketed tables, FileSinkOperator is expected (in some cases)  to produce a specific number
of files even if they are empty.
FileSinkOperator.closeOp(boolean abort) has logic to create files even if empty.

This doesn't property work for Acid path.  For Insert, the OrcRecordUpdater(s) is set up in
createBucketForFileIdx() which creates the actual bucketN file (as of HIVE-14007, it does
it regardless of whether RecordUpdater sees any rows).  This causes empty (i.e.ORC metadata
only) bucket files to be created for multiFileSpray=true if a particular FileSinkOperator.process()
sees at least 1 row.  For example,
{noformat}
create table fourbuckets (a int, b int) clustered by (a) into 4 buckets stored as orc TBLPROPERTIES
('transactional'='true');
insert into fourbuckets values(0,1),(1,1);
with mapreduce.job.reduces = 1 or 2 
{noformat}

For Update/Delete path, OrcRecordWriter is created lazily when the 1st row that needs to land
there is seen.  Thus it never creates empty buckets no mater what the value of _skipFiles_
in closeOp(boolean).

Once Split Update does the split early (in operator pipeline) only the Insert path will matter
since base and delta are the only files split computation, etc looks at.  delete_delta is
only for Acid internals so there is never any reason for create empty files there.


  was:
For bucketed tables, FileSinkOperator is expected (in some cases)  to produce a specific number
of files even if they are empty.
FileSinkOperator.closeOp(boolean abort) has logic to create files even if empty.

This doesn't property work for Acid path.  For Insert, the OrcRecordUpdater(s) is set up in
createBucketForFileIdx() which creates the actual bucketN file (as of HIVE-14007, it does
it regardless of whether RecordUpdate sees any rows).  This causes empty (i.e.ORC metadata
only) bucket files to be created.  For example,
{noformat}
create table fourbuckets (a int, b int) clustered by (a) into 4 buckets stored as orc TBLPROPERTIES
('transactional'='true');
insert into fourbuckets values(0,1),(1,1);
{noformat}

For Update/Delete path, OrcRecordWriter is created lazily when the 1st row that needs to land
there is seen.  Thus it never creates empty buckets no mater what the value of _skipFiles_
in closeOp(boolean).

Once Split Update does the split early (in operator pipeline) only the Insert path will matter
since base and delta are the only files split computation, etc looks at.  delete_delta is
only for Acid internals so there is never any reason for create empty files there.



> FileSinkOperator doesn't create empty files for acid path
> ---------------------------------------------------------
>
>                 Key: HIVE-17138
>                 URL: https://issues.apache.org/jira/browse/HIVE-17138
>             Project: Hive
>          Issue Type: Bug
>          Components: Transactions
>    Affects Versions: 2.2.0
>            Reporter: Eugene Koifman
>            Assignee: Eugene Koifman
>
> For bucketed tables, FileSinkOperator is expected (in some cases)  to produce a specific
number of files even if they are empty.
> FileSinkOperator.closeOp(boolean abort) has logic to create files even if empty.
> This doesn't property work for Acid path.  For Insert, the OrcRecordUpdater(s) is set
up in createBucketForFileIdx() which creates the actual bucketN file (as of HIVE-14007, it
does it regardless of whether RecordUpdater sees any rows).  This causes empty (i.e.ORC metadata
only) bucket files to be created for multiFileSpray=true if a particular FileSinkOperator.process()
sees at least 1 row.  For example,
> {noformat}
> create table fourbuckets (a int, b int) clustered by (a) into 4 buckets stored as orc
TBLPROPERTIES ('transactional'='true');
> insert into fourbuckets values(0,1),(1,1);
> with mapreduce.job.reduces = 1 or 2 
> {noformat}
> For Update/Delete path, OrcRecordWriter is created lazily when the 1st row that needs
to land there is seen.  Thus it never creates empty buckets no mater what the value of _skipFiles_
in closeOp(boolean).
> Once Split Update does the split early (in operator pipeline) only the Insert path will
matter since base and delta are the only files split computation, etc looks at.  delete_delta
is only for Acid internals so there is never any reason for create empty files there.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message