From dev-return-4101-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Sun Jun 6 04:29:16 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 5DFBC18063B for ; Sun, 6 Jun 2021 06:29:16 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id 688523EC71 for ; Sun, 6 Jun 2021 04:28:59 +0000 (UTC) Received: (qmail 72578 invoked by uid 500); 6 Jun 2021 04:28:58 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 72566 invoked by uid 99); 6 Jun 2021 04:28:57 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Sun, 06 Jun 2021 04:28:57 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id A2FB21FF47B for ; Sun, 6 Jun 2021 04:28:56 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0.252 X-Spam-Level: X-Spam-Status: No, score=0.252 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=0.2, RCVD_IN_MSPIKE_H3=0.001, RCVD_IN_MSPIKE_WL=0.001, SPF_PASS=-0.001, URIBL_BLOCKED=0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-ec2-va.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id ZElofSpFAE2g for ; Sun, 6 Jun 2021 04:28:55 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=209.85.218.54; helo=mail-ej1-f54.google.com; envelope-from=liujialun10@gmail.com; receiver= Received: from mail-ej1-f54.google.com (mail-ej1-f54.google.com [209.85.218.54]) by mx1-ec2-va.apache.org (ASF Mail Server at mx1-ec2-va.apache.org) with ESMTPS id 5909EBD223 for ; Sun, 6 Jun 2021 04:28:55 +0000 (UTC) Received: by mail-ej1-f54.google.com with SMTP id ce15so20930262ejb.4 for ; Sat, 05 Jun 2021 21:28:55 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=pf4Cot6QFayrR7wDHHHc1aWZ8wWFZ6VV8Jx1uPXj37s=; b=cGv9FqaDA/6Wlc7xrC5msn8ALTDScSMMH7B24uJXCxm6PTpRNqzOokVCFxH//53cEi r+qGjOkEVzJ5uUMBJVrZNHq+J6HhicFzbY41DIY+mvsimkWMLLD+UBxFmWQeiKnxtvG3 LNk2yHRT2QUrz3o6eql2s4xLMUloPaHnQF4oblRAwQyvSuBhQ9TwQyJyNe5iPzynXAyH bkonzHGuLjvh6ava3d6bXLtRTRvQfaTLAYxxH0flghKznGpxAq5ZeyyonGiZSdAWrNcj iLJyWMRODM/BpiO9pfRdHbkdpaDYi40y39nRe2B4chLXuGD2l3oXPN9AOh0fcErfOU5F S0uQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=pf4Cot6QFayrR7wDHHHc1aWZ8wWFZ6VV8Jx1uPXj37s=; b=mPNG9QmHKVbIz/CxxCCDvdyb5FupJNtvnNs9Q9ef7zNKwRDZLrTRnCc/fqU5lHt/Ea +vzh9IqFu6v3cWygeMODUKCjn42Z8/0h18q7oogVqrbTrdlPYmv04X4uaVPA5KsfcYSZ PtLIZ7KadXBJmav6DsUXwBxkoQav/36Yb2P3yI5ucxN5PXSoXojPNWkekY/0K18CtTmC JfZyLNPTEEDjwxXtOcC6mSO5PfpWeNgjBfIx24MbHKuii2B3B4n++arlHrod8av0JRWH Ggxv1wh7UM9JTByjATz8JvcFbVy0eJX9oDoqsZTh6y6uqa8MM+vDakgbLX6oDxyBVCKC PQXA== X-Gm-Message-State: AOAM533abyt94y0RszYVX9+aAST+i2sH8+Etwft/4hzx3M9sohxfuSh7 vQi7lnYbyWjFpBQB0Vl3Ga8wIyuV24JX6hNQaqq3Njhvm9BHWg== X-Google-Smtp-Source: ABdhPJxfmndVoVwFpnTqMGGoJqepwNUrCtCqQypPzZcV1vPT3JgXyNdSfKftL8zsxTitDqXBizpns6nnqHs7c/jG3zc= X-Received: by 2002:a17:906:4109:: with SMTP id j9mr12212824ejk.250.1622953728067; Sat, 05 Jun 2021 21:28:48 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Jialun Liu Date: Sat, 5 Jun 2021 21:28:37 -0700 Message-ID: Subject: Re: Could Hudi Data lake support low latency, high throughput random reads? To: felix.jose@philips.com.invalid Cc: "dev@hudi.apache.org" Content-Type: multipart/alternative; boundary="00000000000072c9ee05c4115c9d" --00000000000072c9ee05c4115c9d Content-Type: text/plain; charset="UTF-8" Hey Felix, Thanks for your reply! I briefly researched in Presto, it looks like it is designed to support the high concurrency of Big data SQL query. The official doc suggests it could process queries in sub-seconds to minutes. https://prestodb.io/ "Presto is targeted at analysts who expect response times ranging from sub-second to minutes." However, the doc seems to suggest that it is supposed to be used by analysts running offline queries, and it is not designed to be used as an OLTP database. https://prestodb.io/docs/current/overview/use-cases.html I am wondering if it is technically possible to use data lake to support milliseconds latency, high throughput random reads at all today? Am I just not thinking in the right direction? Maybe it is just not sane to serve online request-response service using Data lake as backend? Best regards, Bill On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix wrote: > Hi Bill, > > Did you try using Presto (from EMR) to query HUDI tables on S3, and it > could support real time queries. And you have to partition your data > properly to minimize the amount of data each query has to scan/process. > > Regards, > Felix K Jose > From: Jialun Liu > Date: Saturday, June 5, 2021 at 3:53 PM > To: dev@hudi.apache.org > Subject: Could Hudi Data lake support low latency, high throughput random > reads? > Caution: This e-mail originated from outside of Philips, be careful for > phishing. > > > Hey guys, > > I am not sure if this is the right forum for this question, if you know > where this should be directed, appreciated for your help! > > The question is that "Could Hudi Data lake support low latency, high > throughput random reads?". > > I am considering building a data lake that produces auxiliary information > for my main service table. Example, say my main service is S3 and I want to > produce the S3 object pull count as the auxiliary information. I am going > to use Apache Hudi and EMR to process the S3 access log to produce the pull > count. Now, what I don't know is that can data lake support low latency, > high throughput random reads for online request-response type of service? > This way I could serve this information to customers in real time. > > I could write the auxiliary information, pull count, back to the main > service table, but I personally don't think it is a sustainable > architecture. It would be hard to do independent and agile development if I > continue to add more derived attributes to the main table. > > Any help would be appreciated! > > Best regards, > Bill > > ________________________________ > The information contained in this message may be confidential and legally > protected under applicable law. The message is intended solely for the > addressee(s). If you are not the intended recipient, you are hereby > notified that any use, forwarding, dissemination, or reproduction of this > message is strictly prohibited and may be unlawful. If you are not the > intended recipient, please contact the sender by return e-mail and destroy > all copies of the original message. > --00000000000072c9ee05c4115c9d--