hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Li <yanjia.gary...@gmail.com>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Thu, 03 Jun 2021 02:15:32 GMT
+1. Hash index is very efficient for CDC data with random updates. Also
friendly for streaming ingestion. Looking forward to this feature!

Best,
Gary

On Thu, Jun 3, 2021 at 1:51 AM Satish Kotha <satishkotha@uber.com.invalid>
wrote:

> +1.   You may want to read this thread
> <
> http://mail-archives.apache.org/mod_mbox/hudi-dev/202102.mbox/%3CCADNkrp7MHbNH_s2Svyo%2B56xJu-v7knzgE9sed8MrWAXQC3LQCw%40mail.gmail.com%3E
> >
> as well. There are minor differences between these threads, but the high
> level idea is similar.
>
> On Wed, Jun 2, 2021 at 7:42 AM 耿筱喻 <gengxiaoyu1996@gmail.com> wrote:
>
> > Hi,
> > Currently, Hudi index implementation is pluggable and provides two
> > options: bloom filter and hbase. When a Hudi table becomes large, the
> > performance of bloom filter degrade drastically due to the increase in
> > false positive probability.
> >
> > Hash index is an efficient light-weight approach to address the
> > performance issue. It is used in Hive called Bucket, which clusters the
> > records whose key have the same hash value under a unique hash function.
> > This pre-distribution can accelerate the sql query in some scenarios.
> > Besides, Bucket in Hive offers the efficient sampling.
> >
> > I make a RFC for this
> > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index.
> >
> > Feel free to discuss under this thread and suggestions are welcomed.
> >
> > Regards,
> > Shawy
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message