hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: [DISCUSS] Hudi is the data lake platform
Date Mon, 02 Aug 2021 16:39:54 GMT
Thanks! Will work on it this week.
Also redoing some images based on feedback.

On Fri, Jul 30, 2021 at 2:06 AM vino yang <yanghua1127@gmail.com> wrote:

> +1
>
> Pratyaksh Sharma <pratyaksh13@gmail.com> 于2021年7月30日周五 上午1:47写道:
>
> > Guess we should rebrand Hudi on README.md file as well -
> > https://github.com/apache/hudi#readme?
> >
> > This page still mentions the following -
> >
> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
> > Incrementals. Hudi manages the storage of large analytical datasets on
> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
> >
> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <vinoth@apache.org>
> wrote:
> >
> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
> >>
> >> Will land this monday, giving it more time over the weekend as well.
> >>
> >>
> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1127@gmail.com>
> wrote:
> >>
> >> > Thanks vc
> >> >
> >> > Very good blog, in-depth and forward-looking. Learned!
> >> >
> >> > Best,
> >> > Vino
> >> >
> >> > Vinoth Chandar <vinoth@apache.org> 于2021年7月22日周四 上午3:58写道:
> >> >
> >> > > Expanding to users@ as well.
> >> > >
> >> > > Hi all,
> >> > >
> >> > > Since this discussion, I started to pen down a coherent strategy and
> >> > convey
> >> > > these ideas via a blog post.
> >> > > I have also done my own research, talked to (ex)colleagues I respect
> >> to
> >> > get
> >> > > their take and refine it.
> >> > >
> >> > > Here's a blog that hopefully explains this vision.
> >> > >
> >> > > https://github.com/apache/hudi/pull/3322
> >> > >
> >> > > Look forward to your feedback on the PR. We are hoping to land this
> >> early
> >> > > next week, if everyone is aligned.
> >> > >
> >> > > Thanks
> >> > > Vinoth
> >> > >
> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637554@gmail.com>
> wrote:
> >> > >
> >> > > > +1 , Cannot agree more.
> >> > > >  *aux metadata* and metatable, can make hudi have large
> preformance
> >> > > > optimization on query end.
> >> > > > Can continuous develop.
> >> > > > cache service may the necessary component in cloud native
> >> environment.
> >> > > >
> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar <vinoth@apache.org>
wrote:
> >> > > > > Hello all,
> >> > > > >
> >> > > > > Reading one more article today, positioning Hudi, as just
a
> table
> >> > > format,
> >> > > > > made me wonder, if we have done enough justice in explaining
> what
> >> we
> >> > > have
> >> > > > > built together here.
> >> > > > > I tend to think of Hudi as the data lake platform, which
has the
> >> > > > following
> >> > > > > components, of which - one if a table format, one is a
> >> transactional
> >> > > > > storage layer.
> >> > > > > But the whole stack we have is definitely worth more than
the
> sum
> >> of
> >> > > all
> >> > > > > the parts IMO (speaking from my own experience from the
past 10+
> >> > years
> >> > > of
> >> > > > > open source software dev).
> >> > > > >
> >> > > > > Here's what we have built so far.
> >> > > > >
> >> > > > > a) *table format* : something that stores table schema,
a
> metadata
> >> > > table
> >> > > > > that stores file listing today, and being extended to store
> column
> >> > > ranges
> >> > > > > and more in the future (RFC-27)
> >> > > > > b) *aux metadata* : bloom filters, external record level
indexes
> >> > today,
> >> > > > > bitmaps/interval trees and other advanced on-disk data
> structures
> >> > > > tomorrow
> >> > > > > c) *concurrency control* : we always supported MVCC based
log
> >> based
> >> > > > > concurrency (serialize writes into a time ordered log),
and we
> now
> >> > also
> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> >> > multi-table
> >> > > > and
> >> > > > > fully non-blocking writers soon (see future work section
of
> >> RFC-22)
> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case
for
> >> > Hudi,
> >> > > > but
> >> > > > > we support primary/unique key constraints and we could add
> foreign
> >> > keys
> >> > > > as
> >> > > > > an extension, once our transactions can span tables.
> >> > > > > e) *table services*: a hudi pipeline today is self-managing
-
> >> sizes
> >> > > > files,
> >> > > > > cleans, compacts, clusters data, bootstraps existing data
- all
> >> these
> >> > > > > actions working off each other without blocking one another.
> (for
> >> > most
> >> > > > > parts).
> >> > > > > f) *data services*: we also have higher level functionality
with
> >> > > > > deltastreamer sources (scalable DFS listing source, Kafka,
> Pulsar
> >> is
> >> > > > > coming, ...and more), incremental ETL support, de-duplication,
> >> commit
> >> > > > > callbacks, pre-commit validations are coming, error tables
have
> >> been
> >> > > > > proposed. I could also envision us building towards streaming
> >> egress,
> >> > > > data
> >> > > > > monitoring.
> >> > > > >
> >> > > > > I also think we should build the following (subject to separate
> >> > > > > DISCUSS/RFCs)
> >> > > > >
> >> > > > > g) *caching service*: Hudi specific caching service that
can
> hold
> >> > > mutable
> >> > > > > data and serve oft-queried data across engines.
> >> > > > > h) t*imeline metaserver:* We already run a metaserver in
spark
> >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata
table.
> >> Let's
> >> > > > turn
> >> > > > > it into a scalable, sharded metastore, that all engines
can use
> to
> >> > > obtain
> >> > > > > any metadata.
> >> > > > >
> >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*"
as
> >> > opposed
> >> > > to
> >> > > > > "ingests & manages storage of large analytical datasets
over DFS
> >> > (hdfs
> >> > > or
> >> > > > > cloud stores)." and convey the scope of our vision,
> >> > > > > given we have already been building towards that. It would
also
> >> > provide
> >> > > > new
> >> > > > > contributors a good lens to look at the project from.
> >> > > > >
> >> > > > > (This is very similar to for e.g, the evolution of Kafka
from a
> >> > pub-sub
> >> > > > > system, to an event streaming platform - with addition of
> >> > > > > MirrorMaker/Connect etc. )
> >> > > > >
> >> > > > > Please share your thoughts!
> >> > > > >
> >> > > > > Thanks
> >> > > > > Vinoth
> >> > > > >
> >> > > >
> >> > >
> >> >
> >>
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message