hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gary Li <gar...@apache.org>
Subject Re: Could Hudi Data lake support low latency, high throughput random reads?
Date Sun, 06 Jun 2021 09:19:01 GMT
Hi Bill,

Data lake was used for offline analytics workload with minutes latency.
Data lake(at least for Hudi) doesn't fit for online request-response
service as you described for now.

Best,
Gary

On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialun10@gmail.com> wrote:

> Hey Felix,
>
> Thanks for your reply!
>
> I briefly researched in Presto, it looks like it is designed to support the
> high concurrency of Big data SQL query. The official doc suggests it could
> process queries in sub-seconds to minutes.
> https://prestodb.io/
> "Presto is targeted at analysts who expect response times ranging from
> sub-second to minutes."
>
> However, the doc seems to suggest that it is supposed to be used by
> analysts running offline queries, and it is not designed to be used as an
> OLTP database.
> https://prestodb.io/docs/current/overview/use-cases.html
>
> I am wondering if it is technically possible to use data lake to support
> milliseconds latency, high throughput random reads at all today? Am I just
> not thinking in the right direction? Maybe it is just not sane to serve
> online request-response service using Data lake as backend?
>
> Best regards,
> Bill
>
> On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> <felix.jose@philips.com.invalid> wrote:
>
> > Hi Bill,
> >
> > Did you try using Presto (from EMR) to query HUDI tables on S3, and it
> > could support real time queries. And you have to partition your data
> > properly to minimize the amount of data each query has to scan/process.
> >
> > Regards,
> > Felix K Jose
> > From: Jialun Liu <liujialun10@gmail.com>
> > Date: Saturday, June 5, 2021 at 3:53 PM
> > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > Subject: Could Hudi Data lake support low latency, high throughput random
> > reads?
> > Caution: This e-mail originated from outside of Philips, be careful for
> > phishing.
> >
> >
> > Hey guys,
> >
> > I am not sure if this is the right forum for this question, if you know
> > where this should be directed, appreciated for your help!
> >
> > The question is that "Could Hudi Data lake support low latency, high
> > throughput random reads?".
> >
> > I am considering building a data lake that produces auxiliary information
> > for my main service table. Example, say my main service is S3 and I want
> to
> > produce the S3 object pull count as the auxiliary information. I am going
> > to use Apache Hudi and EMR to process the S3 access log to produce the
> pull
> > count. Now, what I don't know is that can data lake support low latency,
> > high throughput random reads for online request-response type of service?
> > This way I could serve this information to customers in real time.
> >
> > I could write the auxiliary information, pull count, back to the main
> > service table, but I personally don't think it is a sustainable
> > architecture. It would be hard to do independent and agile development
> if I
> > continue to add more derived attributes to the main table.
> >
> > Any help would be appreciated!
> >
> > Best regards,
> > Bill
> >
> > ________________________________
> > The information contained in this message may be confidential and legally
> > protected under applicable law. The message is intended solely for the
> > addressee(s). If you are not the intended recipient, you are hereby
> > notified that any use, forwarding, dissemination, or reproduction of this
> > message is strictly prohibited and may be unlawful. If you are not the
> > intended recipient, please contact the sender by return e-mail and
> destroy
> > all copies of the original message.
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message