hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Pratyaksh Sharma <pratyaks...@gmail.com>
Subject Re: [DISCUSS] Hudi is the data lake platform
Date Thu, 29 Jul 2021 17:47:17 GMT
Guess we should rebrand Hudi on README.md file as well -
https://github.com/apache/hudi#readme?

This page still mentions the following -

"Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
Incrementals. Hudi manages the storage of large analytical datasets on DFS
(Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."

On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <vinoth@apache.org> wrote:

> Thanks Vino! Got a bunch of emoticons on the PR as well.
>
> Will land this monday, giving it more time over the weekend as well.
>
>
> On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1127@gmail.com> wrote:
>
> > Thanks vc
> >
> > Very good blog, in-depth and forward-looking. Learned!
> >
> > Best,
> > Vino
> >
> > Vinoth Chandar <vinoth@apache.org> 于2021年7月22日周四 上午3:58写道:
> >
> > > Expanding to users@ as well.
> > >
> > > Hi all,
> > >
> > > Since this discussion, I started to pen down a coherent strategy and
> > convey
> > > these ideas via a blog post.
> > > I have also done my own research, talked to (ex)colleagues I respect to
> > get
> > > their take and refine it.
> > >
> > > Here's a blog that hopefully explains this vision.
> > >
> > > https://github.com/apache/hudi/pull/3322
> > >
> > > Look forward to your feedback on the PR. We are hoping to land this
> early
> > > next week, if everyone is aligned.
> > >
> > > Thanks
> > > Vinoth
> > >
> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637554@gmail.com> wrote:
> > >
> > > > +1 , Cannot agree more.
> > > >  *aux metadata* and metatable, can make hudi have large preformance
> > > > optimization on query end.
> > > > Can continuous develop.
> > > > cache service may the necessary component in cloud native
> environment.
> > > >
> > > > On 2021/04/13 05:29:55, Vinoth Chandar <vinoth@apache.org> wrote:
> > > > > Hello all,
> > > > >
> > > > > Reading one more article today, positioning Hudi, as just a table
> > > format,
> > > > > made me wonder, if we have done enough justice in explaining what
> we
> > > have
> > > > > built together here.
> > > > > I tend to think of Hudi as the data lake platform, which has the
> > > > following
> > > > > components, of which - one if a table format, one is a
> transactional
> > > > > storage layer.
> > > > > But the whole stack we have is definitely worth more than the sum
> of
> > > all
> > > > > the parts IMO (speaking from my own experience from the past 10+
> > years
> > > of
> > > > > open source software dev).
> > > > >
> > > > > Here's what we have built so far.
> > > > >
> > > > > a) *table format* : something that stores table schema, a metadata
> > > table
> > > > > that stores file listing today, and being extended to store column
> > > ranges
> > > > > and more in the future (RFC-27)
> > > > > b) *aux metadata* : bloom filters, external record level indexes
> > today,
> > > > > bitmaps/interval trees and other advanced on-disk data structures
> > > > tomorrow
> > > > > c) *concurrency control* : we always supported MVCC based log based
> > > > > concurrency (serialize writes into a time ordered log), and we now
> > also
> > > > > have OCC for batch merge workloads with 0.8.0. We will have
> > multi-table
> > > > and
> > > > > fully non-blocking writers soon (see future work section of RFC-22)
> > > > > d) *updates/deletes* : this is the bread-and-butter use-case for
> > Hudi,
> > > > but
> > > > > we support primary/unique key constraints and we could add foreign
> > keys
> > > > as
> > > > > an extension, once our transactions can span tables.
> > > > > e) *table services*: a hudi pipeline today is self-managing - sizes
> > > > files,
> > > > > cleans, compacts, clusters data, bootstraps existing data - all
> these
> > > > > actions working off each other without blocking one another. (for
> > most
> > > > > parts).
> > > > > f) *data services*: we also have higher level functionality with
> > > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar
> is
> > > > > coming, ...and more), incremental ETL support, de-duplication,
> commit
> > > > > callbacks, pre-commit validations are coming, error tables have
> been
> > > > > proposed. I could also envision us building towards streaming
> egress,
> > > > data
> > > > > monitoring.
> > > > >
> > > > > I also think we should build the following (subject to separate
> > > > > DISCUSS/RFCs)
> > > > >
> > > > > g) *caching service*: Hudi specific caching service that can hold
> > > mutable
> > > > > data and serve oft-queried data across engines.
> > > > > h) t*imeline metaserver:* We already run a metaserver in spark
> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata table.
> Let's
> > > > turn
> > > > > it into a scalable, sharded metastore, that all engines can use to
> > > obtain
> > > > > any metadata.
> > > > >
> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as
> > opposed
> > > to
> > > > > "ingests & manages storage of large analytical datasets over
DFS
> > (hdfs
> > > or
> > > > > cloud stores)." and convey the scope of our vision,
> > > > > given we have already been building towards that. It would also
> > provide
> > > > new
> > > > > contributors a good lens to look at the project from.
> > > > >
> > > > > (This is very similar to for e.g, the evolution of Kafka from a
> > pub-sub
> > > > > system, to an event streaming platform - with addition of
> > > > > MirrorMaker/Connect etc. )
> > > > >
> > > > > Please share your thoughts!
> > > > >
> > > > > Thanks
> > > > > Vinoth
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message