From dev-return-4095-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Fri Jun  4 13:24:18 2021
Return-Path: <dev-return-4095-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id DFA6A18066B
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  4 Jun 2021 15:24:18 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 4557F6202E
	for <archive-asf-public@cust-asf.ponee.io>; Fri,  4 Jun 2021 13:24:18 +0000 (UTC)
Received: (qmail 29008 invoked by uid 500); 4 Jun 2021 13:24:17 -0000
Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@hudi.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@hudi.apache.org>
List-Post: <mailto:dev@hudi.apache.org>
List-Id: <dev.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list dev@hudi.apache.org
Received: (qmail 28996 invoked by uid 99); 4 Jun 2021 13:24:17 -0000
Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61)
    by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2021 13:24:17 +0000
Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169])
	by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPSA id 6C7753E8B1
	for <dev@hudi.apache.org>; Fri,  4 Jun 2021 13:24:16 +0000 (UTC)
Received: by mail-lj1-f169.google.com with SMTP id w15so11498753ljo.10
        for <dev@hudi.apache.org>; Fri, 04 Jun 2021 06:24:16 -0700 (PDT)
X-Gm-Message-State: AOAM532vh6my8Su1velbmkDkWKVPvvCscZvenFN1+PEIiEgZ8+4+LzaA
	38DbT63G9UZ9ksTpUdVO2PY3cE0PNo3eYK4U7aY=
X-Google-Smtp-Source: ABdhPJyfA/B1YiKTQs7aWE48sgfg+olKod5K79ImtauLHXikrhJxxAWQLa6qJV17WUydF2qWcUplOnYmue9GcyR3CzE=
X-Received: by 2002:a2e:b167:: with SMTP id a7mr3543386ljm.181.1622813055846;
 Fri, 04 Jun 2021 06:24:15 -0700 (PDT)
MIME-Version: 1.0
References: <4888B31D-FFD9-42FB-A050-99556936A48C@gmail.com>
 <CADXAPZE7L0VJJ0gMvmwb7L=AMTmzNegHNYg5oqeJ8mMcnEOH5g@mail.gmail.com> <F415E170-A328-4171-8919-C41119A01012@gmail.com>
In-Reply-To: <F415E170-A328-4171-8919-C41119A01012@gmail.com>
From: Vinoth Chandar <vinoth@apache.org>
Date: Fri, 4 Jun 2021 06:24:04 -0700
X-Gmail-Original-Message-ID: <CAKw-+5Q1y3qbZ4pN=uUB6htxJURFoyC-1Yg_v-tciCdoNw2N_w@mail.gmail.com>
Message-ID: <CAKw-+5Q1y3qbZ4pN=uUB6htxJURFoyC-1Yg_v-tciCdoNw2N_w@mail.gmail.com>
Subject: Re: [DISCUSS] Hash Index for HUDI
To: dev <dev@hudi.apache.org>
Content-Type: multipart/alternative; boundary="000000000000bb0e4705c3f09b60"

--000000000000bb0e4705c3f09b60
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Thanks for opening the RFC! At first glance, it seemed similar to RFC-08,
but the proposal seems to be adding a bucket id to each file group ID?
If I may suggest, we should call this BucketedIndex?

Instead of changing the existing file name, can we simply assign the
filegroupID as the hash mod value?  i.e just make the fileGroupIDs 0 -
numBuckets-1 (with some hash value of the partition path also for
uniqueness across table)?
This way this is a localized change, not a major change is how we name
files/objects?

I will review the RFC more carefully, early next week.

Thanks
Vinoth


On Fri, Jun 4, 2021 at 3:05 AM =E8=80=BF=E7=AD=B1=E5=96=BB <gengxiaoyu1996@=
gmail.com> wrote:

> Thank you for your questions.
>
> For the first question, the number of buckets expanded by mutiple is
> recommended. Combine rehashing and clustering to re-distribute the data
> without shuffling. For example, 2 buckets expands to 4 by splitting the 1=
st
> bucket and rehashing data in it to two small buckets: 1st and 3st bucket.
> Details have been supplied to the RFC.
>
> For the second one, data skew when writing to hudi with hash index can be
> solved by using mutiple file groups per bucket as mentioned in the RFC. T=
o
> data process engine like Spark, data skew when table joining can be solve=
d
> by splitting the skew partition to some smaller units and distributing th=
em
> to different tasks to execute, and it works in some scenarios which has
> fixed sql pattern. Besides, data skew solution needs more effort to be
> compatible with bucket join rule. However, the read and write long tail
> caused by data skew in sql query is hard to be solved.
>
> Regards,
> Shawy
>
> > 2021=E5=B9=B46=E6=9C=883=E6=97=A5 10:47=EF=BC=8CDanny Chan <danny0405@a=
pache.org> =E5=86=99=E9=81=93=EF=BC=9A
> >
> > Thanks for the new feature, very promising ~
> >
> > Some confusion about the *Scalability* and *Data Skew* part:
> >
> > How do we expanded the number of existing buckets, say if we have 100
> > buckets before, but 120 buckets now, what is the algorithm =EF=BC=9F
> >
> > About the data skew, did you mean there is no good solution to solve th=
is
> > problem now ?
> >
> > Best,
> > Danny Chan
> >
> > =E8=80=BF=E7=AD=B1=E5=96=BB <gengxiaoyu1996@gmail.com> =E4=BA=8E2021=E5=
=B9=B46=E6=9C=882=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8B=E5=8D=8810:42=E5=86=
=99=E9=81=93=EF=BC=9A
> >
> >> Hi,
> >> Currently, Hudi index implementation is pluggable and provides two
> >> options: bloom filter and hbase. When a Hudi table becomes large, the
> >> performance of bloom filter degrade drastically due to the increase in
> >> false positive probability.
> >>
> >> Hash index is an efficient light-weight approach to address the
> >> performance issue. It is used in Hive called Bucket, which clusters th=
e
> >> records whose key have the same hash value under a unique hash functio=
n.
> >> This pre-distribution can accelerate the sql query in some scenarios.
> >> Besides, Bucket in Hive offers the efficient sampling.
> >>
> >> I make a RFC for this
> >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Inde=
x
> .
> >>
> >> Feel free to discuss under this thread and suggestions are welcomed.
> >>
> >> Regards,
> >> Shawy
>
>

--000000000000bb0e4705c3f09b60--