From dev-return-4095-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Fri Jun 4 13:24:18 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id DFA6A18066B for ; Fri, 4 Jun 2021 15:24:18 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 4557F6202E for ; Fri, 4 Jun 2021 13:24:18 +0000 (UTC) Received: (qmail 29008 invoked by uid 500); 4 Jun 2021 13:24:17 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 28996 invoked by uid 99); 4 Jun 2021 13:24:17 -0000 Received: from mailrelay1-he-de.apache.org (HELO mailrelay1-he-de.apache.org) (116.203.21.61) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 04 Jun 2021 13:24:17 +0000 Received: from mail-lj1-f169.google.com (mail-lj1-f169.google.com [209.85.208.169]) by mailrelay1-he-de.apache.org (ASF Mail Server at mailrelay1-he-de.apache.org) with ESMTPSA id 6C7753E8B1 for ; Fri, 4 Jun 2021 13:24:16 +0000 (UTC) Received: by mail-lj1-f169.google.com with SMTP id w15so11498753ljo.10 for ; Fri, 04 Jun 2021 06:24:16 -0700 (PDT) X-Gm-Message-State: AOAM532vh6my8Su1velbmkDkWKVPvvCscZvenFN1+PEIiEgZ8+4+LzaA 38DbT63G9UZ9ksTpUdVO2PY3cE0PNo3eYK4U7aY= X-Google-Smtp-Source: ABdhPJyfA/B1YiKTQs7aWE48sgfg+olKod5K79ImtauLHXikrhJxxAWQLa6qJV17WUydF2qWcUplOnYmue9GcyR3CzE= X-Received: by 2002:a2e:b167:: with SMTP id a7mr3543386ljm.181.1622813055846; Fri, 04 Jun 2021 06:24:15 -0700 (PDT) MIME-Version: 1.0 References: <4888B31D-FFD9-42FB-A050-99556936A48C@gmail.com> In-Reply-To: From: Vinoth Chandar Date: Fri, 4 Jun 2021 06:24:04 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Hash Index for HUDI To: dev Content-Type: multipart/alternative; boundary="000000000000bb0e4705c3f09b60" --000000000000bb0e4705c3f09b60 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks for opening the RFC! At first glance, it seemed similar to RFC-08, but the proposal seems to be adding a bucket id to each file group ID? If I may suggest, we should call this BucketedIndex? Instead of changing the existing file name, can we simply assign the filegroupID as the hash mod value? i.e just make the fileGroupIDs 0 - numBuckets-1 (with some hash value of the partition path also for uniqueness across table)? This way this is a localized change, not a major change is how we name files/objects? I will review the RFC more carefully, early next week. Thanks Vinoth On Fri, Jun 4, 2021 at 3:05 AM =E8=80=BF=E7=AD=B1=E5=96=BB wrote: > Thank you for your questions. > > For the first question, the number of buckets expanded by mutiple is > recommended. Combine rehashing and clustering to re-distribute the data > without shuffling. For example, 2 buckets expands to 4 by splitting the 1= st > bucket and rehashing data in it to two small buckets: 1st and 3st bucket. > Details have been supplied to the RFC. > > For the second one, data skew when writing to hudi with hash index can be > solved by using mutiple file groups per bucket as mentioned in the RFC. T= o > data process engine like Spark, data skew when table joining can be solve= d > by splitting the skew partition to some smaller units and distributing th= em > to different tasks to execute, and it works in some scenarios which has > fixed sql pattern. Besides, data skew solution needs more effort to be > compatible with bucket join rule. However, the read and write long tail > caused by data skew in sql query is hard to be solved. > > Regards, > Shawy > > > 2021=E5=B9=B46=E6=9C=883=E6=97=A5 10:47=EF=BC=8CDanny Chan =E5=86=99=E9=81=93=EF=BC=9A > > > > Thanks for the new feature, very promising ~ > > > > Some confusion about the *Scalability* and *Data Skew* part: > > > > How do we expanded the number of existing buckets, say if we have 100 > > buckets before, but 120 buckets now, what is the algorithm =EF=BC=9F > > > > About the data skew, did you mean there is no good solution to solve th= is > > problem now ? > > > > Best, > > Danny Chan > > > > =E8=80=BF=E7=AD=B1=E5=96=BB =E4=BA=8E2021=E5= =B9=B46=E6=9C=882=E6=97=A5=E5=91=A8=E4=B8=89 =E4=B8=8B=E5=8D=8810:42=E5=86= =99=E9=81=93=EF=BC=9A > > > >> Hi, > >> Currently, Hudi index implementation is pluggable and provides two > >> options: bloom filter and hbase. When a Hudi table becomes large, the > >> performance of bloom filter degrade drastically due to the increase in > >> false positive probability. > >> > >> Hash index is an efficient light-weight approach to address the > >> performance issue. It is used in Hive called Bucket, which clusters th= e > >> records whose key have the same hash value under a unique hash functio= n. > >> This pre-distribution can accelerate the sql query in some scenarios. > >> Besides, Bucket in Hive offers the efficient sampling. > >> > >> I make a RFC for this > >> https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Inde= x > . > >> > >> Feel free to discuss under this thread and suggestions are welcomed. > >> > >> Regards, > >> Shawy > > --000000000000bb0e4705c3f09b60--