hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: Could Hudi Data lake support low latency, high throughput random reads?
Date Wed, 23 Jun 2021 22:50:55 GMT
>>>>Maybe it is just not sane to serve online request-response service
using Data lake as backend?
In general, data lakes have not evolved beyond analytics, ML at this point,
i.e optimized for large batch scans.

Not to say that this cannot be possible, but I am skeptical that it will
ever be as low-latency as your regular OLTP database.
Object store random reads are definitely going to cost ~100ms, like reading
from a highly loaded hard drive.

Hudi does support a HFile format, which is more optimized for random reads.
We use it to store and serve table metadata.
So that path is worth pursuing, if you have the appetite for trying the
changing the norm here. :)
There is probably some work to do here for scaling it for large amounts of
data.

Hope that helps.

Thanks
Vinoth

On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu <liujialun10@gmail.com> wrote:

> Hey Gary,
>
> Thanks for your reply!
>
> This is kinda sad that we are not able to serve the insights to commercial
> customers in real time.
>
> Do we have any best practices/ design patterns to get around the problem in
> order to support online service for low latency, high throughput random
> reads by any chance?
>
> Best regards,
> Bill
>
> On Sun, Jun 6, 2021 at 2:19 AM Gary Li <garyli@apache.org> wrote:
>
> > Hi Bill,
> >
> > Data lake was used for offline analytics workload with minutes latency.
> > Data lake(at least for Hudi) doesn't fit for online request-response
> > service as you described for now.
> >
> > Best,
> > Gary
> >
> > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialun10@gmail.com>
> wrote:
> >
> > > Hey Felix,
> > >
> > > Thanks for your reply!
> > >
> > > I briefly researched in Presto, it looks like it is designed to support
> > the
> > > high concurrency of Big data SQL query. The official doc suggests it
> > could
> > > process queries in sub-seconds to minutes.
> > > https://prestodb.io/
> > > "Presto is targeted at analysts who expect response times ranging from
> > > sub-second to minutes."
> > >
> > > However, the doc seems to suggest that it is supposed to be used by
> > > analysts running offline queries, and it is not designed to be used as
> an
> > > OLTP database.
> > > https://prestodb.io/docs/current/overview/use-cases.html
> > >
> > > I am wondering if it is technically possible to use data lake to
> support
> > > milliseconds latency, high throughput random reads at all today? Am I
> > just
> > > not thinking in the right direction? Maybe it is just not sane to serve
> > > online request-response service using Data lake as backend?
> > >
> > > Best regards,
> > > Bill
> > >
> > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > > <felix.jose@philips.com.invalid> wrote:
> > >
> > > > Hi Bill,
> > > >
> > > > Did you try using Presto (from EMR) to query HUDI tables on S3, and
> it
> > > > could support real time queries. And you have to partition your data
> > > > properly to minimize the amount of data each query has to
> scan/process.
> > > >
> > > > Regards,
> > > > Felix K Jose
> > > > From: Jialun Liu <liujialun10@gmail.com>
> > > > Date: Saturday, June 5, 2021 at 3:53 PM
> > > > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > > > Subject: Could Hudi Data lake support low latency, high throughput
> > random
> > > > reads?
> > > > Caution: This e-mail originated from outside of Philips, be careful
> for
> > > > phishing.
> > > >
> > > >
> > > > Hey guys,
> > > >
> > > > I am not sure if this is the right forum for this question, if you
> know
> > > > where this should be directed, appreciated for your help!
> > > >
> > > > The question is that "Could Hudi Data lake support low latency, high
> > > > throughput random reads?".
> > > >
> > > > I am considering building a data lake that produces auxiliary
> > information
> > > > for my main service table. Example, say my main service is S3 and I
> > want
> > > to
> > > > produce the S3 object pull count as the auxiliary information. I am
> > going
> > > > to use Apache Hudi and EMR to process the S3 access log to produce
> the
> > > pull
> > > > count. Now, what I don't know is that can data lake support low
> > latency,
> > > > high throughput random reads for online request-response type of
> > service?
> > > > This way I could serve this information to customers in real time.
> > > >
> > > > I could write the auxiliary information, pull count, back to the main
> > > > service table, but I personally don't think it is a sustainable
> > > > architecture. It would be hard to do independent and agile
> development
> > > if I
> > > > continue to add more derived attributes to the main table.
> > > >
> > > > Any help would be appreciated!
> > > >
> > > > Best regards,
> > > > Bill
> > > >
> > > > ________________________________
> > > > The information contained in this message may be confidential and
> > legally
> > > > protected under applicable law. The message is intended solely for
> the
> > > > addressee(s). If you are not the intended recipient, you are hereby
> > > > notified that any use, forwarding, dissemination, or reproduction of
> > this
> > > > message is strictly prohibited and may be unlawful. If you are not
> the
> > > > intended recipient, please contact the sender by return e-mail and
> > > destroy
> > > > all copies of the original message.
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message