hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: [DISCUSS] Hash Index for HUDI
Date Wed, 30 Jun 2021 22:02:04 GMT
I see that we already have a PR up. Will catch up on it and provide some
initial comments.
Thanks!

On Wed, Jun 16, 2021 at 9:02 AM Shawy Geng <gengxiaoyu1996@gmail.com> wrote:

> Combining bucket index and bloom filter is a great idea. There is no
> conflict between the two in implementation, and the bloom filter info can
> be still stored in the file to position faster.
>
> Best,
> Shawy
>
> > 2021年6月9日 16:23,Thiru Malai <thiru.drake@gmail.com> 写道:
> >
> > Hi,
> >
> > This feature seems promising. If we are planning to assign the
> filegroupID as the hash mod value, then we can leverage this change in
> Bloom Index as well by pruning the files based on hash mod value before mix
> max record_key pruning. So that the exploded RDD will be comparatively
> smaller which will eventually optimise the shuffle size in "Compute all
> comparisons needed between records and files" stages.
> >
> > Can we add this hash based indexing approach to Bloom Filter Based
> approach also
> >
> > On 2021/06/07 03:26:34, Danny Chan <danny0405@apache.org> wrote:
> >>> number of buckets expanded by multiple is recommended
> >> The condition is too harsh and the bucket number would be with
> >> exponential growth.
> >>
> >>> with hash index can be solved by using mutiple file groups per bucket
> as
> >> mentioned in the RFC
> >> The relation of file groups and bucket would be too complicated, we
> should
> >> avoid that. It also requires that the query engine be aware of the
> >> bucketing rules, not that transparent and is not a common query
> >> optimization.
> >>
> >> Best,
> >> Danny Chan
> >>
> >> 耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月4日周五 下午6:06写道:
> >>
> >>> Thank you for your questions.
> >>>
> >>> For the first question, the number of buckets expanded by mutiple is
> >>> recommended. Combine rehashing and clustering to re-distribute the data
> >>> without shuffling. For example, 2 buckets expands to 4 by splitting
> the 1st
> >>> bucket and rehashing data in it to two small buckets: 1st and 3st
> bucket.
> >>> Details have been supplied to the RFC.
> >>>
> >>> For the second one, data skew when writing to hudi with hash index can
> be
> >>> solved by using mutiple file groups per bucket as mentioned in the
> RFC. To
> >>> data process engine like Spark, data skew when table joining can be
> solved
> >>> by splitting the skew partition to some smaller units and distributing
> them
> >>> to different tasks to execute, and it works in some scenarios which has
> >>> fixed sql pattern. Besides, data skew solution needs more effort to be
> >>> compatible with bucket join rule. However, the read and write long tail
> >>> caused by data skew in sql query is hard to be solved.
> >>>
> >>> Regards,
> >>> Shawy
> >>>
> >>>> 2021年6月3日 10:47,Danny Chan <danny0405@apache.org> 写道:
> >>>>
> >>>> Thanks for the new feature, very promising ~
> >>>>
> >>>> Some confusion about the *Scalability* and *Data Skew* part:
> >>>>
> >>>> How do we expanded the number of existing buckets, say if we have 100
> >>>> buckets before, but 120 buckets now, what is the algorithm ?
> >>>>
> >>>> About the data skew, did you mean there is no good solution to solve
> this
> >>>> problem now ?
> >>>>
> >>>> Best,
> >>>> Danny Chan
> >>>>
> >>>> 耿筱喻 <gengxiaoyu1996@gmail.com> 于2021年6月2日周三
下午10:42写道:
> >>>>
> >>>>> Hi,
> >>>>> Currently, Hudi index implementation is pluggable and provides two
> >>>>> options: bloom filter and hbase. When a Hudi table becomes large,
the
> >>>>> performance of bloom filter degrade drastically due to the increase
> in
> >>>>> false positive probability.
> >>>>>
> >>>>> Hash index is an efficient light-weight approach to address the
> >>>>> performance issue. It is used in Hive called Bucket, which clusters
> the
> >>>>> records whose key have the same hash value under a unique hash
> function.
> >>>>> This pre-distribution can accelerate the sql query in some scenarios.
> >>>>> Besides, Bucket in Hive offers the efficient sampling.
> >>>>>
> >>>>> I make a RFC for this
> >>>>>
> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index
> >>> .
> >>>>>
> >>>>> Feel free to discuss under this thread and suggestions are welcomed.
> >>>>>
> >>>>> Regards,
> >>>>> Shawy
> >>>
> >>>
> >>
>
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message