hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawy Geng <>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Wed, 16 Jun 2021 15:54:42 GMT
Thank you for your questions and advice.

Differently from RFC-08, this one doesn’t introduce the HFile to store the mapping of record
and its location. One bucket having a file group is one of the options. For one file group
per bucket, assigning bucket id to file group id is a great idea. This part of the RFC has
been modified.


> 2021年6月4日 21:24,Vinoth Chandar <> 写道:
> Thanks for opening the RFC! At first glance, it seemed similar to RFC-08,
> but the proposal seems to be adding a bucket id to each file group ID?
> If I may suggest, we should call this BucketedIndex?
> Instead of changing the existing file name, can we simply assign the
> filegroupID as the hash mod value?  i.e just make the fileGroupIDs 0 -
> numBuckets-1 (with some hash value of the partition path also for
> uniqueness across table)?
> This way this is a localized change, not a major change is how we name
> files/objects?
> I will review the RFC more carefully, early next week.
> Thanks
> Vinoth
> On Fri, Jun 4, 2021 at 3:05 AM 耿筱喻 <> wrote:
>> Thank you for your questions.
>> For the first question, the number of buckets expanded by mutiple is
>> recommended. Combine rehashing and clustering to re-distribute the data
>> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
>> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
>> Details have been supplied to the RFC.
>> For the second one, data skew when writing to hudi with hash index can be
>> solved by using mutiple file groups per bucket as mentioned in the RFC. To
>> data process engine like Spark, data skew when table joining can be solved
>> by splitting the skew partition to some smaller units and distributing them
>> to different tasks to execute, and it works in some scenarios which has
>> fixed sql pattern. Besides, data skew solution needs more effort to be
>> compatible with bucket join rule. However, the read and write long tail
>> caused by data skew in sql query is hard to be solved.
>> Regards,
>> Shawy
>>> 2021年6月3日 10:47,Danny Chan <> 写道:
>>> Thanks for the new feature, very promising ~
>>> Some confusion about the *Scalability* and *Data Skew* part:
>>> How do we expanded the number of existing buckets, say if we have 100
>>> buckets before, but 120 buckets now, what is the algorithm ?
>>> About the data skew, did you mean there is no good solution to solve this
>>> problem now ?
>>> Best,
>>> Danny Chan
>>> 耿筱喻 <> 于2021年6月2日周三 下午10:42写道:
>>>> Hi,
>>>> Currently, Hudi index implementation is pluggable and provides two
>>>> options: bloom filter and hbase. When a Hudi table becomes large, the
>>>> performance of bloom filter degrade drastically due to the increase in
>>>> false positive probability.
>>>> Hash index is an efficient light-weight approach to address the
>>>> performance issue. It is used in Hive called Bucket, which clusters the
>>>> records whose key have the same hash value under a unique hash function.
>>>> This pre-distribution can accelerate the sql query in some scenarios.
>>>> Besides, Bucket in Hive offers the efficient sampling.
>>>> I make a RFC for this
>> .
>>>> Feel free to discuss under this thread and suggestions are welcomed.
>>>> Regards,
>>>> Shawy

View raw message