hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sungwoo (Jira)" <j...@apache.org>
Subject [jira] [Commented] (HIVE-23114) Insert overwrite with dynamic partitioning is not working correctly with direct insert
Date Tue, 07 Apr 2020 22:25:00 GMT

    [ https://issues.apache.org/jira/browse/HIVE-23114?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17077612#comment-17077612
] 

Sungwoo commented on HIVE-23114:
--------------------------------

Hello [~kuczoram],

I tried this combination: HIVE-21164 (ffee30) + HIVE-22832 + HIVE-23114.2.patch. I tested
with 100GB dataset, and the ORC database is now loaded okay. You can see that the size of
catalog_returns is 911MB.

{code:sh}
$ hdfs dfs -du -h /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/
7.6 K    22.9 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/call_center
722.2 K  2.1 M    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_page
911.1 M  2.7 G    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_returns
9.0 G    27.0 G   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/catalog_sales
73.9 M   221.8 M  /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer
12.4 M   37.3 M   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer_address
137.1 K  411.3 K  /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/customer_demographics
348.5 K  1.0 M    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/date_dim
1.5 K    4.4 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/household_demographics
881      2.6 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/income_band
846.0 M  2.5 G    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/inventory
13.5 M   40.5 M   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/item
30.6 K   91.9 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/promotion
1.4 K    4.2 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/reason
1.8 K    5.3 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/ship_mode
27.4 K   82.2 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store
1.4 G    4.2 G    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store_returns
13.1 G   39.3 G   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/store_sales
125.1 K  375.4 K  /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/time_dim
2.9 K    8.7 K    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/warehouse
22.7 K   68.1 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_page
437.8 M  1.3 G    /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_returns
4.8 G    14.4 G   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_sales
6.3 K    18.8 K   /tmp/hivemr3/warehouse/tpcds_bin_partitioned_orc_100.db/web_site
{code}

Let me report the result of running TPC-DS queries later. (There is a problem with my setup
for testing.)

> Insert overwrite with dynamic partitioning is not working correctly with direct insert
> --------------------------------------------------------------------------------------
>
>                 Key: HIVE-23114
>                 URL: https://issues.apache.org/jira/browse/HIVE-23114
>             Project: Hive
>          Issue Type: Bug
>            Reporter: Marta Kuczora
>            Assignee: Marta Kuczora
>            Priority: Major
>         Attachments: HIVE-23114.1.patch, HIVE-23114.2.patch
>
>
> This is a follow-up Jira for the [conversation|https://issues.apache.org/jira/browse/HIVE-21164?focusedCommentId=17059280&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-17059280]
in HIVE-21164
>  Doing an insert overwrite from a multi-insert statement with dynamic partitioning will
give wrong results for ACID tables when 'hive.acid.direct.insert.enabled' is true or for insert-only
tables.
> Reproduction:
> {noformat}
> set hive.acid.direct.insert.enabled=true;
> set hive.support.concurrency=true;
> set hive.txn.manager=org.apache.hadoop.hive.ql.lockmgr.DbTxnManager;
> set hive.vectorized.execution.enabled=false;
> set hive.stats.autogather=false;
> create external table multiinsert_test_text (a int, b int, c int) stored as textfile;
> insert into multiinsert_test_text values (1111, 11, 1111), (2222, 22, 1111), (3333, 33,
2222), (4444, 44, NULL), (5555, 55, NULL);
> create table multiinsert_test_acid (a int, b int) partitioned by (c int) stored as orc
tblproperties('transactional'='true');
> create table multiinsert_test_mm (a int, b int) partitioned by (c int) stored as orc
tblproperties('transactional'='true', 'transactional_properties'='insert_only');
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_acid partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_acid;
> from multiinsert_test_text a
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
>  where a.c is not null
> insert overwrite table multiinsert_test_mm partition (c)
> select
>  a.a,
>  a.b,
>  a.c
> where a.c is null;
> select * from multiinsert_test_mm;
> {noformat}
> The result of these steps can be different, it depends on the execution order of the
FileSinkOperators of the insert overwrite statements. It can happen that an error occurs due
to manifest file collision, it can happen that no error occurs but the result will be incorrect.
>  Running the same insert query with an external table of with and ACID table with 'hive.acid.direct.insert.enabled=false'
will give the follwing result:
> {noformat}
> 1111    11      1111
> 2222    22      1111
> 3333    33      2222
> 4444    44      NULL
> 5555    55      NULL
> {noformat}



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message