hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jialun Liu <liujialu...@gmail.com>
Subject Re: Could Hudi Data lake support low latency, high throughput random reads?
Date Mon, 07 Jun 2021 23:03:38 GMT
Hey Gary,

Thanks for your reply!

This is kinda sad that we are not able to serve the insights to commercial
customers in real time.

Do we have any best practices/ design patterns to get around the problem in
order to support online service for low latency, high throughput random
reads by any chance?

Best regards,
Bill

On Sun, Jun 6, 2021 at 2:19 AM Gary Li <garyli@apache.org> wrote:

> Hi Bill,
>
> Data lake was used for offline analytics workload with minutes latency.
> Data lake(at least for Hudi) doesn't fit for online request-response
> service as you described for now.
>
> Best,
> Gary
>
> On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialun10@gmail.com> wrote:
>
> > Hey Felix,
> >
> > Thanks for your reply!
> >
> > I briefly researched in Presto, it looks like it is designed to support
> the
> > high concurrency of Big data SQL query. The official doc suggests it
> could
> > process queries in sub-seconds to minutes.
> > https://prestodb.io/
> > "Presto is targeted at analysts who expect response times ranging from
> > sub-second to minutes."
> >
> > However, the doc seems to suggest that it is supposed to be used by
> > analysts running offline queries, and it is not designed to be used as an
> > OLTP database.
> > https://prestodb.io/docs/current/overview/use-cases.html
> >
> > I am wondering if it is technically possible to use data lake to support
> > milliseconds latency, high throughput random reads at all today? Am I
> just
> > not thinking in the right direction? Maybe it is just not sane to serve
> > online request-response service using Data lake as backend?
> >
> > Best regards,
> > Bill
> >
> > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > <felix.jose@philips.com.invalid> wrote:
> >
> > > Hi Bill,
> > >
> > > Did you try using Presto (from EMR) to query HUDI tables on S3, and it
> > > could support real time queries. And you have to partition your data
> > > properly to minimize the amount of data each query has to scan/process.
> > >
> > > Regards,
> > > Felix K Jose
> > > From: Jialun Liu <liujialun10@gmail.com>
> > > Date: Saturday, June 5, 2021 at 3:53 PM
> > > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > > Subject: Could Hudi Data lake support low latency, high throughput
> random
> > > reads?
> > > Caution: This e-mail originated from outside of Philips, be careful for
> > > phishing.
> > >
> > >
> > > Hey guys,
> > >
> > > I am not sure if this is the right forum for this question, if you know
> > > where this should be directed, appreciated for your help!
> > >
> > > The question is that "Could Hudi Data lake support low latency, high
> > > throughput random reads?".
> > >
> > > I am considering building a data lake that produces auxiliary
> information
> > > for my main service table. Example, say my main service is S3 and I
> want
> > to
> > > produce the S3 object pull count as the auxiliary information. I am
> going
> > > to use Apache Hudi and EMR to process the S3 access log to produce the
> > pull
> > > count. Now, what I don't know is that can data lake support low
> latency,
> > > high throughput random reads for online request-response type of
> service?
> > > This way I could serve this information to customers in real time.
> > >
> > > I could write the auxiliary information, pull count, back to the main
> > > service table, but I personally don't think it is a sustainable
> > > architecture. It would be hard to do independent and agile development
> > if I
> > > continue to add more derived attributes to the main table.
> > >
> > > Any help would be appreciated!
> > >
> > > Best regards,
> > > Bill
> > >
> > > ________________________________
> > > The information contained in this message may be confidential and
> legally
> > > protected under applicable law. The message is intended solely for the
> > > addressee(s). If you are not the intended recipient, you are hereby
> > > notified that any use, forwarding, dissemination, or reproduction of
> this
> > > message is strictly prohibited and may be unlawful. If you are not the
> > > intended recipient, please contact the sender by return e-mail and
> > destroy
> > > all copies of the original message.
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message