hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Chan <danny0...@apache.org>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Thu, 03 Jun 2021 02:47:52 GMT
Thanks for the new feature, very promising ~

Some confusion about the *Scalability* and *Data Skew* part:

How do we expanded the number of existing buckets, say if we have 100
buckets before, but 120 buckets now, what is the algorithm ?

About the data skew, did you mean there is no good solution to solve this
problem now ?

Best,
Danny Chan

耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月2日周三 下午10:42写道:

> Hi,
> Currently, Hudi index implementation is pluggable and provides two
> options: bloom filter and hbase. When a Hudi table becomes large, the
> performance of bloom filter degrade drastically due to the increase in
> false positive probability.
>
> Hash index is an efficient light-weight approach to address the
> performance issue. It is used in Hive called Bucket, which clusters the
> records whose key have the same hash value under a unique hash function.
> This pre-distribution can accelerate the sql query in some scenarios.
> Besides, Bucket in Hive offers the efficient sampling.
>
> I make a RFC for this
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.
>
> Feel free to discuss under this thread and suggestions are welcomed.
>
> Regards,
> Shawy

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message