hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Peter Vary (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-22538) RS deduplication does not always enforce hive.optimize.reducededuplication.min.reducer
Date Wed, 27 Nov 2019 13:28:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-22538?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16983503#comment-16983503
] 

Peter Vary commented on HIVE-22538:
-----------------------------------

[~jcamachorodriguez]: Checked one of the failures: TestTxnCommands.testDeleteIn

The problem is with the contents of the delete_delta directories.

After the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d [..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000

Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________

[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d [..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
{code}
Before the patch:
{code:java}
[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d [..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00000

Processing data file file:/tmp/bucket_00000 [length: 686]
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536870912,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________

[root@nightly7x-unsecure-1 tmp]# hive --orcfiledump -d [..]warehouse/acidtbl/delete_delta_0000003_0000003_0000/bucket_00001
Processing data file file:/tmp/bucket_00001 [length: 698]
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":0,"currentTransaction":3,"row":null}
{"operation":2,"originalTransaction":2,"bucket":536936448,"rowId":1,"currentTransaction":3,"row":null}
________________________________________________________________________________________________________________________
{code}
*Notice the difference of the ordering of the rows!* That is what causes the problem.

The table is created by this command:
{code:java}
create table acidTbl(a int, b int) clustered by (a) into 2 buckets stored as orc TBLPROPERTIES
('transactional'='true') {code}
My understanding is that by forcing to use the single reducer we make sure that the ordering
is done globally inside a bucket. With the patch, when there are multiple reducers we might
end up concatenating the results of different reducers, thus losing the ordering. (This last
part of the sentence is more like a question :))

Any other ideas how to set RS traits to force the desired behavior?

Thanks,

Peter

> RS deduplication does not always enforce hive.optimize.reducededuplication.min.reducer
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-22538
>                 URL: https://issues.apache.org/jira/browse/HIVE-22538
>             Project: Hive
>          Issue Type: Bug
>          Components: Physical Optimizer
>            Reporter: Jesus Camacho Rodriguez
>            Assignee: Jesus Camacho Rodriguez
>            Priority: Major
>              Labels: pull-request-available
>         Attachments: HIVE-22538.patch
>
>          Time Spent: 10m
>  Remaining Estimate: 0h
>
> For transactional tables, that property might be overriden to 1, which can lead to merging
final aggregation into a single stage (hence leading to performance degradation). For instance,
when autogather column stats is enabled, this can happen for the following query:
> {code}
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> EXPLAIN
> CREATE TABLE x STORED AS ORC TBLPROPERTIES('transactional'='true') AS
> SELECT * FROM SRC x CLUSTER BY x.key;
> {code}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message