hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Vinoth Chandar <vin...@apache.org>
Subject Re: [DISCUSS] Hudi is the data lake platform
Date Sat, 24 Jul 2021 01:01:35 GMT
Thanks Vino! Got a bunch of emoticons on the PR as well.

Will land this monday, giving it more time over the weekend as well.


On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1127@gmail.com> wrote:

> Thanks vc
>
> Very good blog, in-depth and forward-looking. Learned!
>
> Best,
> Vino
>
> Vinoth Chandar <vinoth@apache.org> 于2021年7月22日周四 上午3:58写道:
>
> > Expanding to users@ as well.
> >
> > Hi all,
> >
> > Since this discussion, I started to pen down a coherent strategy and
> convey
> > these ideas via a blog post.
> > I have also done my own research, talked to (ex)colleagues I respect to
> get
> > their take and refine it.
> >
> > Here's a blog that hopefully explains this vision.
> >
> > https://github.com/apache/hudi/pull/3322
> >
> > Look forward to your feedback on the PR. We are hoping to land this early
> > next week, if everyone is aligned.
> >
> > Thanks
> > Vinoth
> >
> > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637554@gmail.com> wrote:
> >
> > > +1 , Cannot agree more.
> > >  *aux metadata* and metatable, can make hudi have large preformance
> > > optimization on query end.
> > > Can continuous develop.
> > > cache service may the necessary component in cloud native environment.
> > >
> > > On 2021/04/13 05:29:55, Vinoth Chandar <vinoth@apache.org> wrote:
> > > > Hello all,
> > > >
> > > > Reading one more article today, positioning Hudi, as just a table
> > format,
> > > > made me wonder, if we have done enough justice in explaining what we
> > have
> > > > built together here.
> > > > I tend to think of Hudi as the data lake platform, which has the
> > > following
> > > > components, of which - one if a table format, one is a transactional
> > > > storage layer.
> > > > But the whole stack we have is definitely worth more than the sum of
> > all
> > > > the parts IMO (speaking from my own experience from the past 10+
> years
> > of
> > > > open source software dev).
> > > >
> > > > Here's what we have built so far.
> > > >
> > > > a) *table format* : something that stores table schema, a metadata
> > table
> > > > that stores file listing today, and being extended to store column
> > ranges
> > > > and more in the future (RFC-27)
> > > > b) *aux metadata* : bloom filters, external record level indexes
> today,
> > > > bitmaps/interval trees and other advanced on-disk data structures
> > > tomorrow
> > > > c) *concurrency control* : we always supported MVCC based log based
> > > > concurrency (serialize writes into a time ordered log), and we now
> also
> > > > have OCC for batch merge workloads with 0.8.0. We will have
> multi-table
> > > and
> > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> Hudi,
> > > but
> > > > we support primary/unique key constraints and we could add foreign
> keys
> > > as
> > > > an extension, once our transactions can span tables.
> > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > files,
> > > > cleans, compacts, clusters data, bootstraps existing data - all these
> > > > actions working off each other without blocking one another. (for
> most
> > > > parts).
> > > > f) *data services*: we also have higher level functionality with
> > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is
> > > > coming, ...and more), incremental ETL support, de-duplication, commit
> > > > callbacks, pre-commit validations are coming, error tables have been
> > > > proposed. I could also envision us building towards streaming egress,
> > > data
> > > > monitoring.
> > > >
> > > > I also think we should build the following (subject to separate
> > > > DISCUSS/RFCs)
> > > >
> > > > g) *caching service*: Hudi specific caching service that can hold
> > mutable
> > > > data and serve oft-queried data across engines.
> > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's
> > > turn
> > > > it into a scalable, sharded metastore, that all engines can use to
> > obtain
> > > > any metadata.
> > > >
> > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> opposed
> > to
> > > > "ingests & manages storage of large analytical datasets over DFS
> (hdfs
> > or
> > > > cloud stores)." and convey the scope of our vision,
> > > > given we have already been building towards that. It would also
> provide
> > > new
> > > > contributors a good lens to look at the project from.
> > > >
> > > > (This is very similar to for e.g, the evolution of Kafka from a
> pub-sub
> > > > system, to an event streaming platform - with addition of
> > > > MirrorMaker/Connect etc. )
> > > >
> > > > Please share your thoughts!
> > > >
> > > > Thanks
> > > > Vinoth
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message