hive-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "JithendhiraKumar (Jira)" <>
Subject [jira] [Commented] (HIVE-22098) Data loss occurs when multiple tables are join with different bucket_version
Date Tue, 11 Feb 2020 13:40:00 GMT


JithendhiraKumar commented on HIVE-22098:

[~luguangming] Was there any progress on the patch being made available in master?

> Data loss occurs when multiple tables are join with different bucket_version
> ----------------------------------------------------------------------------
>                 Key: HIVE-22098
>                 URL:
>             Project: Hive
>          Issue Type: Bug
>          Components: Operators
>    Affects Versions: 3.1.0
>            Reporter: LuGuangMing
>            Assignee: LuGuangMing
>            Priority: Major
>         Attachments: HIVE-22098.1.patch, image-2019-08-12-18-45-15-771.png, join_test.sql,
table_a_data.orc, table_b_data.orc, table_c_data.orc
> When different bucketVersion of tables do join and  reducers number greater than 2,
result is easy to lose data.
> *Scenario 1*: Three tables join. The temporary result data of table_a in the first table
and table_b in the second table joins result is recorded as tmp_a_b, When it joins with the
third table, the bucket_version=2 of the table created by default after hive-3.0.0, temporary
data tmp_a_b initialized the bucketVerison=-1, and then ReduceSinkOperator Verketison=-1 is
joined. In the init method, the hash algorithm of selecting join column is selected according
to bucketVersion. If bucketVersion = 2 and is not an acid operation, it will acquired the
new algorithm of hash. Otherwise, the old algorithm of hash is acquired. Because of the inconsistency
of the algorithm of hash, the partition of data allocation caused are different. At stage
of Reducer, Data with the same key can not be paired resulting in data loss.
> *Scenario 2*: create two test tables, create table table_bucketversion_1(col_1 string,
col_2 string) TBLPROPERTIES ('bucketing_version'='1'); table_bucketversion_2(col_1 string,
col_2 string) TBLPROPERTIES ('bucketing_version'='2');
> when use table_bucketversion_1 to join table_bucketversion_2, partial result data will
be loss due to bucketVerison is different.

This message was sent by Atlassian Jira

View raw message