hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 耿筱喻 <gengxiaoyu1...@gmail.com>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Fri, 04 Jun 2021 10:05:32 GMT
Thank you for your questions.

For the first question, the number of buckets expanded by mutiple is recommended. Combine
rehashing and clustering to re-distribute the data without shuffling. For example, 2 buckets
expands to 4 by splitting the 1st bucket and rehashing data in it to two small buckets: 1st
and 3st bucket. Details have been supplied to the RFC.

For the second one, data skew when writing to hudi with hash index can be solved by using
mutiple file groups per bucket as mentioned in the RFC. To data process engine like Spark,
data skew when table joining can be solved by splitting the skew partition to some smaller
units and distributing them to different tasks to execute, and it works in some scenarios
which has fixed sql pattern. Besides, data skew solution needs more effort to be compatible
with bucket join rule. However, the read and write long tail caused by data skew in sql query
is hard to be solved.

Regards,
Shawy

> 2021年6月3日 10:47,Danny Chan <danny0405@apache.org> 写道:
> 
> Thanks for the new feature, very promising ~
> 
> Some confusion about the *Scalability* and *Data Skew* part:
> 
> How do we expanded the number of existing buckets, say if we have 100
> buckets before, but 120 buckets now, what is the algorithm ?
> 
> About the data skew, did you mean there is no good solution to solve this
> problem now ?
> 
> Best,
> Danny Chan
> 
> 耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月2日周三 下午10:42写道:
> 
>> Hi,
>> Currently, Hudi index implementation is pluggable and provides two
>> options: bloom filter and hbase. When a Hudi table becomes large, the
>> performance of bloom filter degrade drastically due to the increase in
>> false positive probability.
>> 
>> Hash index is an efficient light-weight approach to address the
>> performance issue. It is used in Hive called Bucket, which clusters the
>> records whose key have the same hash value under a unique hash function.
>> This pre-distribution can accelerate the sql query in some scenarios.
>> Besides, Bucket in Hive offers the efficient sampling.
>> 
>> I make a RFC for this
>> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.
>> 
>> Feel free to discuss under this thread and suggestions are welcomed.
>> 
>> Regards,
>> Shawy


Mime
View raw message