hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From 耿筱喻 <>
Subject [DISCUSS] Hash Index for HUDI
Date Wed, 02 Jun 2021 14:42:25 GMT
Currently, Hudi index implementation is pluggable and provides two options: bloom filter and
hbase. When a Hudi table becomes large, the performance of bloom filter degrade drastically
due to the increase in false positive probability.

Hash index is an efficient light-weight approach to address the performance issue. It is used
in Hive called Bucket, which clusters the records whose key have the same hash value under
a unique hash function. This pre-distribution can accelerate the sql query in some scenarios.
Besides, Bucket in Hive offers the efficient sampling. 

I make a RFC for this

Feel free to discuss under this thread and suggestions are welcomed.

View raw message