hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jialun Liu <liujialu...@gmail.com>
Subject Re: Could Hudi Data lake support low latency, high throughput random reads?
Date Sun, 27 Jun 2021 23:03:04 GMT
Thanks a lot Vinoth!

Best regards,
Bill

On Sat, Jun 26, 2021 at 9:24 PM Vinoth Chandar <vinoth@apache.org> wrote:

> Yes. Thats a working approach.
>
> One thing I would like to suggest is the use of Hudi’s incremental queries
> to update DynamoDB as opposed to full exporting periodically, depending on
> how much of your target dynamoDB table changes between loads, it can save
> you cost and time.
>
> On Sat, Jun 26, 2021 at 5:43 PM Jialun Liu <liujialun10@gmail.com> wrote:
>
> > Hey Vinoth,
> >
> > Thanks for your reply!
> >
> > I am actually looking into a different direction atm. Basically write the
> > transformed data into a OLTP database, e.g. DynamoDB, any data need to
> > support low latency high throughput read would be exported periodically.
> >
> > Not sure if this is the right pattern, appreciated if you can point me to
> > any similar architecture that I could study.
> >
> > Best regards,
> > Bill
> >
> > On Wed, Jun 23, 2021 at 3:51 PM Vinoth Chandar <vinoth@apache.org>
> wrote:
> >
> > > >>>>Maybe it is just not sane to serve online request-response
service
> > > using Data lake as backend?
> > > In general, data lakes have not evolved beyond analytics, ML at this
> > point,
> > > i.e optimized for large batch scans.
> > >
> > > Not to say that this cannot be possible, but I am skeptical that it
> will
> > > ever be as low-latency as your regular OLTP database.
> > > Object store random reads are definitely going to cost ~100ms, like
> > reading
> > > from a highly loaded hard drive.
> > >
> > > Hudi does support a HFile format, which is more optimized for random
> > reads.
> > > We use it to store and serve table metadata.
> > > So that path is worth pursuing, if you have the appetite for trying the
> > > changing the norm here. :)
> > > There is probably some work to do here for scaling it for large amounts
> > of
> > > data.
> > >
> > > Hope that helps.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Mon, Jun 7, 2021 at 4:04 PM Jialun Liu <liujialun10@gmail.com>
> wrote:
> > >
> > > > Hey Gary,
> > > >
> > > > Thanks for your reply!
> > > >
> > > > This is kinda sad that we are not able to serve the insights to
> > > commercial
> > > > customers in real time.
> > > >
> > > > Do we have any best practices/ design patterns to get around the
> > problem
> > > in
> > > > order to support online service for low latency, high throughput
> random
> > > > reads by any chance?
> > > >
> > > > Best regards,
> > > > Bill
> > > >
> > > > On Sun, Jun 6, 2021 at 2:19 AM Gary Li <garyli@apache.org> wrote:
> > > >
> > > > > Hi Bill,
> > > > >
> > > > > Data lake was used for offline analytics workload with minutes
> > latency.
> > > > > Data lake(at least for Hudi) doesn't fit for online
> request-response
> > > > > service as you described for now.
> > > > >
> > > > > Best,
> > > > > Gary
> > > > >
> > > > > On Sun, Jun 6, 2021 at 12:29 PM Jialun Liu <liujialun10@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hey Felix,
> > > > > >
> > > > > > Thanks for your reply!
> > > > > >
> > > > > > I briefly researched in Presto, it looks like it is designed
to
> > > support
> > > > > the
> > > > > > high concurrency of Big data SQL query. The official doc suggests
> > it
> > > > > could
> > > > > > process queries in sub-seconds to minutes.
> > > > > > https://prestodb.io/
> > > > > > "Presto is targeted at analysts who expect response times ranging
> > > from
> > > > > > sub-second to minutes."
> > > > > >
> > > > > > However, the doc seems to suggest that it is supposed to be
used
> by
> > > > > > analysts running offline queries, and it is not designed to
be
> used
> > > as
> > > > an
> > > > > > OLTP database.
> > > > > > https://prestodb.io/docs/current/overview/use-cases.html
> > > > > >
> > > > > > I am wondering if it is technically possible to use data lake
to
> > > > support
> > > > > > milliseconds latency, high throughput random reads at all today?
> > Am I
> > > > > just
> > > > > > not thinking in the right direction? Maybe it is just not sane
to
> > > serve
> > > > > > online request-response service using Data lake as backend?
> > > > > >
> > > > > > Best regards,
> > > > > > Bill
> > > > > >
> > > > > > On Sat, Jun 5, 2021 at 1:33 PM Kizhakkel Jose, Felix
> > > > > > <felix.jose@philips.com.invalid> wrote:
> > > > > >
> > > > > > > Hi Bill,
> > > > > > >
> > > > > > > Did you try using Presto (from EMR) to query HUDI tables
on S3,
> > and
> > > > it
> > > > > > > could support real time queries. And you have to partition
your
> > > data
> > > > > > > properly to minimize the amount of data each query has
to
> > > > scan/process.
> > > > > > >
> > > > > > > Regards,
> > > > > > > Felix K Jose
> > > > > > > From: Jialun Liu <liujialun10@gmail.com>
> > > > > > > Date: Saturday, June 5, 2021 at 3:53 PM
> > > > > > > To: dev@hudi.apache.org <dev@hudi.apache.org>
> > > > > > > Subject: Could Hudi Data lake support low latency, high
> > throughput
> > > > > random
> > > > > > > reads?
> > > > > > > Caution: This e-mail originated from outside of Philips,
be
> > careful
> > > > for
> > > > > > > phishing.
> > > > > > >
> > > > > > >
> > > > > > > Hey guys,
> > > > > > >
> > > > > > > I am not sure if this is the right forum for this question,
if
> > you
> > > > know
> > > > > > > where this should be directed, appreciated for your help!
> > > > > > >
> > > > > > > The question is that "Could Hudi Data lake support low
latency,
> > > high
> > > > > > > throughput random reads?".
> > > > > > >
> > > > > > > I am considering building a data lake that produces auxiliary
> > > > > information
> > > > > > > for my main service table. Example, say my main service
is S3
> > and I
> > > > > want
> > > > > > to
> > > > > > > produce the S3 object pull count as the auxiliary information.
> I
> > am
> > > > > going
> > > > > > > to use Apache Hudi and EMR to process the S3 access log
to
> > produce
> > > > the
> > > > > > pull
> > > > > > > count. Now, what I don't know is that can data lake support
low
> > > > > latency,
> > > > > > > high throughput random reads for online request-response
type
> of
> > > > > service?
> > > > > > > This way I could serve this information to customers in
real
> > time.
> > > > > > >
> > > > > > > I could write the auxiliary information, pull count, back
to
> the
> > > main
> > > > > > > service table, but I personally don't think it is a sustainable
> > > > > > > architecture. It would be hard to do independent and agile
> > > > development
> > > > > > if I
> > > > > > > continue to add more derived attributes to the main table.
> > > > > > >
> > > > > > > Any help would be appreciated!
> > > > > > >
> > > > > > > Best regards,
> > > > > > > Bill
> > > > > > >
> > > > > > > ________________________________
> > > > > > > The information contained in this message may be confidential
> and
> > > > > legally
> > > > > > > protected under applicable law. The message is intended
solely
> > for
> > > > the
> > > > > > > addressee(s). If you are not the intended recipient, you
are
> > hereby
> > > > > > > notified that any use, forwarding, dissemination, or
> reproduction
> > > of
> > > > > this
> > > > > > > message is strictly prohibited and may be unlawful. If
you are
> > not
> > > > the
> > > > > > > intended recipient, please contact the sender by return
e-mail
> > and
> > > > > > destroy
> > > > > > > all copies of the original message.
> > > > > > >
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message