hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Danny Chan <danny0...@apache.org>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Mon, 07 Jun 2021 03:26:34 GMT
> number of buckets expanded by multiple is recommended
The condition is too harsh and the bucket number would be with
exponential growth.

> with hash index can be solved by using mutiple file groups per bucket as
mentioned in the RFC
The relation of file groups and bucket would be too complicated, we should
avoid that. It also requires that the query engine be aware of the
bucketing rules, not that transparent and is not a common query
optimization.

Best,
Danny Chan

耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月4日周五 下午6:06写道:

> Thank you for your questions.
>
> For the first question, the number of buckets expanded by mutiple is
> recommended. Combine rehashing and clustering to re-distribute the data
> without shuffling. For example, 2 buckets expands to 4 by splitting the 1st
> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
> Details have been supplied to the RFC.
>
> For the second one, data skew when writing to hudi with hash index can be
> solved by using mutiple file groups per bucket as mentioned in the RFC. To
> data process engine like Spark, data skew when table joining can be solved
> by splitting the skew partition to some smaller units and distributing them
> to different tasks to execute, and it works in some scenarios which has
> fixed sql pattern. Besides, data skew solution needs more effort to be
> compatible with bucket join rule. However, the read and write long tail
> caused by data skew in sql query is hard to be solved.
>
> Regards,
> Shawy
>
> > 2021年6月3日 10:47,Danny Chan <danny0405@apache.org> 写道:
> >
> > Thanks for the new feature, very promising ~
> >
> > Some confusion about the *Scalability* and *Data Skew* part:
> >
> > How do we expanded the number of existing buckets, say if we have 100
> > buckets before, but 120 buckets now, what is the algorithm ?
> >
> > About the data skew, did you mean there is no good solution to solve this
> > problem now ?
> >
> > Best,
> > Danny Chan
> >
> > 耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月2日周三 下午10:42写道:
> >
> >> Hi,
> >> Currently, Hudi index implementation is pluggable and provides two
> >> options: bloom filter and hbase. When a Hudi table becomes large, the
> >> performance of bloom filter degrade drastically due to the increase in
> >> false positive probability.
> >>
> >> Hash index is an efficient light-weight approach to address the
> >> performance issue. It is used in Hive called Bucket, which clusters the
> >> records whose key have the same hash value under a unique hash function.
> >> This pre-distribution can accelerate the sql query in some scenarios.
> >> Besides, Bucket in Hive offers the efficient sampling.
> >>
> >> I make a RFC for this
> >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> .
> >>
> >> Feel free to discuss under this thread and suggestions are welcomed.
> >>
> >> Regards,
> >> Shawy
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message