hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shawy Geng <>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Wed, 16 Jun 2021 16:01:58 GMT
Combining bucket index and bloom filter is a great idea. There is no conflict between the two
in implementation, and the bloom filter info can be still stored in the file to position faster.


> 2021年6月9日 16:23,Thiru Malai <> 写道:
> Hi,
> This feature seems promising. If we are planning to assign the filegroupID as the hash
mod value, then we can leverage this change in Bloom Index as well by pruning the files based
on hash mod value before mix max record_key pruning. So that the exploded RDD will be comparatively
smaller which will eventually optimise the shuffle size in "Compute all comparisons needed
between records and files" stages.
> Can we add this hash based indexing approach to Bloom Filter Based approach also
> On 2021/06/07 03:26:34, Danny Chan <> wrote: 
>>> number of buckets expanded by multiple is recommended
>> The condition is too harsh and the bucket number would be with
>> exponential growth.
>>> with hash index can be solved by using mutiple file groups per bucket as
>> mentioned in the RFC
>> The relation of file groups and bucket would be too complicated, we should
>> avoid that. It also requires that the query engine be aware of the
>> bucketing rules, not that transparent and is not a common query
>> optimization.
>> Best,
>> Danny Chan
>> 耿筱喻 <> 于2021年6月4日周五 下午6:06写道:
>>> Thank you for your questions.
>>> For the first question, the number of buckets expanded by mutiple is
>>> recommended. Combine rehashing and clustering to re-distribute the data
>>> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
>>> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
>>> Details have been supplied to the RFC.
>>> For the second one, data skew when writing to hudi with hash index can be
>>> solved by using mutiple file groups per bucket as mentioned in the RFC. To
>>> data process engine like Spark, data skew when table joining can be solved
>>> by splitting the skew partition to some smaller units and distributing them
>>> to different tasks to execute, and it works in some scenarios which has
>>> fixed sql pattern. Besides, data skew solution needs more effort to be
>>> compatible with bucket join rule. However, the read and write long tail
>>> caused by data skew in sql query is hard to be solved.
>>> Regards,
>>> Shawy
>>>> 2021年6月3日 10:47,Danny Chan <> 写道:
>>>> Thanks for the new feature, very promising ~
>>>> Some confusion about the *Scalability* and *Data Skew* part:
>>>> How do we expanded the number of existing buckets, say if we have 100
>>>> buckets before, but 120 buckets now, what is the algorithm ?
>>>> About the data skew, did you mean there is no good solution to solve this
>>>> problem now ?
>>>> Best,
>>>> Danny Chan
>>>> 耿筱喻 <> 于2021年6月2日周三 下午10:42写道:
>>>>> Hi,
>>>>> Currently, Hudi index implementation is pluggable and provides two
>>>>> options: bloom filter and hbase. When a Hudi table becomes large, the
>>>>> performance of bloom filter degrade drastically due to the increase in
>>>>> false positive probability.
>>>>> Hash index is an efficient light-weight approach to address the
>>>>> performance issue. It is used in Hive called Bucket, which clusters the
>>>>> records whose key have the same hash value under a unique hash function.
>>>>> This pre-distribution can accelerate the sql query in some scenarios.
>>>>> Besides, Bucket in Hive offers the efficient sampling.
>>>>> I make a RFC for this
>>> .
>>>>> Feel free to discuss under this thread and suggestions are welcomed.
>>>>> Regards,
>>>>> Shawy

View raw message