hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jialun Liu <liujialu...@gmail.com>
Subject Re: Could Hudi Data lake support low latency, high throughput random reads?
Date Sun, 06 Jun 2021 04:28:37 GMT
Hey Felix,

Thanks for your reply!

I briefly researched in Presto, it looks like it is designed to support the
high concurrency of Big data SQL query. The official doc suggests it could
process queries in sub-seconds to minutes.
https://prestodb.io/
"Presto is targeted at analysts who expect response times ranging from
sub-second to minutes."

However, the doc seems to suggest that it is supposed to be used by
analysts running offline queries, and it is not designed to be used as an
OLTP database.
https://prestodb.io/docs/current/overview/use-cases.html

I am wondering if it is technically possible to use data lake to support
milliseconds latency, high throughput random reads at all today? Am I just
not thinking in the right direction? Maybe it is just not sane to serve
online request-response service using Data lake as backend?

Best regards,
Bill

On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
<felix.jose@philips.com.invalid> wrote:

> Hi Bill,
>
> Did you try using Presto (from EMR) to query HUDI tables on S3, and it
> could support real time queries. And you have to partition your data
> properly to minimize the amount of data each query has to scan/process.
>
> Regards,
> Felix K Jose
> From: Jialun Liu <liujialun10@gmail.com>
> Date: Saturday, June 5, 2021 at 3:53 PM
> To: dev@hudi.apache.org <dev@hudi.apache.org>
> Subject: Could Hudi Data lake support low latency, high throughput random
> reads?
> Caution: This e-mail originated from outside of Philips, be careful for
> phishing.
>
>
> Hey guys,
>
> I am not sure if this is the right forum for this question, if you know
> where this should be directed, appreciated for your help!
>
> The question is that "Could Hudi Data lake support low latency, high
> throughput random reads?".
>
> I am considering building a data lake that produces auxiliary information
> for my main service table. Example, say my main service is S3 and I want to
> produce the S3 object pull count as the auxiliary information. I am going
> to use Apache Hudi and EMR to process the S3 access log to produce the pull
> count. Now, what I don't know is that can data lake support low latency,
> high throughput random reads for online request-response type of service?
> This way I could serve this information to customers in real time.
>
> I could write the auxiliary information, pull count, back to the main
> service table, but I personally don't think it is a sustainable
> architecture. It would be hard to do independent and agile development if I
> continue to add more derived attributes to the main table.
>
> Any help would be appreciated!
>
> Best regards,
> Bill
>
> ________________________________
> The information contained in this message may be confidential and legally
> protected under applicable law. The message is intended solely for the
> addressee(s). If you are not the intended recipient, you are hereby
> notified that any use, forwarding, dissemination, or reproduction of this
> message is strictly prohibited and may be unlawful. If you are not the
> intended recipient, please contact the sender by return e-mail and destroy
> all copies of the original message.
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message