From dev-return-4084-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Wed Jun 2 17:51:37 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id D58B6180638 for ; Wed, 2 Jun 2021 19:51:37 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id 0F45E61655 for ; Wed, 2 Jun 2021 17:51:36 +0000 (UTC) Received: (qmail 94108 invoked by uid 500); 2 Jun 2021 17:51:36 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 94096 invoked by uid 99); 2 Jun 2021 17:51:35 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 02 Jun 2021 17:51:35 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id 4889E1FF3A1 for ; Wed, 2 Jun 2021 17:51:35 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: -0.698 X-Spam-Level: X-Spam-Status: No, score=-0.698 tagged_above=-999 required=6.31 tests=[DKIMWL_WL_HIGH=-0.698, DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, HTML_MESSAGE=0.2, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (1024-bit key) header.d=uber.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id f61Kg-tH3ZCX for ; Wed, 2 Jun 2021 17:51:34 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::631; helo=mail-ej1-x631.google.com; envelope-from=satishkotha@uber.com; receiver= Received: from mail-ej1-x631.google.com (mail-ej1-x631.google.com [IPv6:2a00:1450:4864:20::631]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id D62ED7FF84 for ; Wed, 2 Jun 2021 17:51:34 +0000 (UTC) Received: by mail-ej1-x631.google.com with SMTP id e18so5123549eje.5 for ; Wed, 02 Jun 2021 10:51:34 -0700 (PDT) X-ASF-DKIM-Sig: v=1; a=rsa-sha256; c=relaxed/relaxed; d=uber.com; s=google; h=mime-version:references:in-reply-to:from:date:message-id:subject:to; bh=SAGd3UKztqk09q7QAJxz3sC+ywRkmIEwSEqemXTm5zY=; b=n+xR8Lq5r4T+Rc1bcZVlEPxH0qeOa392rTzgTW3d/iDNXZxGlrbelNiMz1dhrJ7Txm QarzT26CA3Xdffmaqacjwet86F6+Zto0J44FDfNk57l2MyB7Ei/FiMSrjg5r4hkqDamS qDK8f2bpXUo8rwP6cCS0i/VToR/+5cr2uiqII= X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to; bh=SAGd3UKztqk09q7QAJxz3sC+ywRkmIEwSEqemXTm5zY=; b=NuzqPbdX5oLkFBXXYhPhPcJZRADHSQvuxkj6+2ZDy/H5SUgre9lqpnfHSorwxYNZgX LQsWrxDg0izk/JASuIN8zjxOeaVTzj0nhfKlV+sZMGvldzRdefWQrNITi+mr91vUWHoC KQazaXIn5vaQ2xYRDNcWq3O7TDB9/nBz6sEh58Wq/94LLdYn/sflBZd1eXc2nfFC5bXd A8KdTwsr3pMD4svIr+s/NufuLVEMEXQBqNbfatF1gCXCKF0+ox2mrA8Ze4X71jgI3gX+ K9cv7L/cNAID8ZogSPJx1cWYtvFve+KsEayO9iNY4UVFQMvlp+o3mysvXQBw/3oJq9Wo ztZw== X-Gm-Message-State: AOAM530wzJnr9uBbkV7m+U2y0ZARGcxKbi+Sw5O9lc3ZGPQlucgLHHSB lm65fzbuHLJelo7v+5gKlz3bx6RYd8QtdwbSKkyVnDt4I1hpAg== X-Google-Smtp-Source: ABdhPJxcsQNIa9+sS9ZhT6dCcWuNs29TY0ffpAos3Tj26q3ZcS8qjzE7uCDc9i9FU8ONQ0zLfqH2UJeUe7eJgZVYTNQ= X-Received: by 2002:a17:906:56ca:: with SMTP id an10mr23057436ejc.328.1622656293914; Wed, 02 Jun 2021 10:51:33 -0700 (PDT) MIME-Version: 1.0 References: <4888B31D-FFD9-42FB-A050-99556936A48C@gmail.com> In-Reply-To: <4888B31D-FFD9-42FB-A050-99556936A48C@gmail.com> From: Satish Kotha Date: Wed, 2 Jun 2021 10:51:22 -0700 Message-ID: Subject: Re: [DISCUSS] Hash Index for HUDI To: dev@hudi.apache.org Content-Type: multipart/alternative; boundary="000000000000fe88b205c3cc1bda" --000000000000fe88b205c3cc1bda Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable +1. You may want to read this thread as well. There are minor differences between these threads, but the high level idea is similar. On Wed, Jun 2, 2021 at 7:42 AM =E8=80=BF=E7=AD=B1=E5=96=BB wrote: > Hi, > Currently, Hudi index implementation is pluggable and provides two > options: bloom filter and hbase. When a Hudi table becomes large, the > performance of bloom filter degrade drastically due to the increase in > false positive probability. > > Hash index is an efficient light-weight approach to address the > performance issue. It is used in Hive called Bucket, which clusters the > records whose key have the same hash value under a unique hash function. > This pre-distribution can accelerate the sql query in some scenarios. > Besides, Bucket in Hive offers the efficient sampling. > > I make a RFC for this > https://cwiki.apache.org/confluence/display/HUDI/RFC+-+29%3A+Hash+Index. > > Feel free to discuss under this thread and suggestions are welcomed. > > Regards, > Shawy --000000000000fe88b205c3cc1bda--