hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 耿筱喻 <gengxiaoyu1...@gmail.com>
Subject [DISCUSS] Hash Index for HUDI
Date Wed, 02 Jun 2021 14:42:25 GMT
Hi, 
Currently, Hudi index implementation is pluggable and provides two options: bloom filter and
hbase. When a Hudi table becomes large, the performance of bloom filter degrade drastically
due to the increase in false positive probability.

Hash index is an efficient light-weight approach to address the performance issue. It is used
in Hive called Bucket, which clusters the records whose key have the same hash value under
a unique hash function. This pre-distribution can accelerate the sql query in some scenarios.
Besides, Bucket in Hive offers the efficient sampling. 

I make a RFC for this https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.

Feel free to discuss under this thread and suggestions are welcomed.

Regards,
Shawy
Mime
View raw message