From dev-return-4277-archive-asf-public=cust-asf.ponee.io@hudi.apache.org  Wed Aug  4 17:36:20 2021
Return-Path: <dev-return-4277-archive-asf-public=cust-asf.ponee.io@hudi.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37])
	by mx-eu-01.ponee.io (Postfix) with ESMTPS id 59BCB18068A
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  4 Aug 2021 19:36:20 +0200 (CEST)
Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153])
	by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id D04F961BC8
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  4 Aug 2021 17:36:18 +0000 (UTC)
Received: (qmail 44892 invoked by uid 500); 4 Aug 2021 17:36:17 -0000
Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:dev-help@hudi.apache.org>
List-Unsubscribe: <mailto:dev-unsubscribe@hudi.apache.org>
List-Post: <mailto:dev@hudi.apache.org>
List-Id: <dev.hudi.apache.org>
Reply-To: dev@hudi.apache.org
Delivered-To: mailing list dev@hudi.apache.org
Received: (qmail 44865 invoked by uid 99); 4 Aug 2021 17:36:17 -0000
Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Aug 2021 17:36:17 +0000
Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53])
	by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPSA id 835F53EA10;
	Wed,  4 Aug 2021 17:36:17 +0000 (UTC)
Received: by mail-lf1-f53.google.com with SMTP id g13so5740127lfj.12;
        Wed, 04 Aug 2021 10:36:17 -0700 (PDT)
X-Gm-Message-State: AOAM533BTf+qyPoDKBx/ie/7APgH5Wiyd4QAAqF7X4DlQMUsrANgAUjM
	he2BUiWtYo/X4X06s5DlJvcSnJKCBTQg+h8ZCP0=
X-Google-Smtp-Source: ABdhPJzZmx3lyideKHV/oyfZuFlbG9NSww3Oe9ZItrVvuQYdpP1ZCRKw5Kp85IEkoVHpFoU5RZ1jXjxrdRuZnYyAAkA=
X-Received: by 2002:a05:6512:6d2:: with SMTP id u18mr277795lff.486.1628098576680;
 Wed, 04 Aug 2021 10:36:16 -0700 (PDT)
MIME-Version: 1.0
References: <CAKw-+5STn5XLz+YToC3fLyGe9AmVoWvk_0kNmcKthB-zPEf5_A@mail.gmail.com>
 <pony-9a0d5c36ef57846eaabfdfb1d4e2556367925ac1-2aa4ff5a140060534ad47fcd9168f78fc6df80de@dev.hudi.apache.org>
 <CAKw-+5TuH8DSAYK7_OaTqaLPGtkLZYkarCQaj920iwgmguHa8g@mail.gmail.com>
 <CAA_=o7CoZt+dZjfQEP-PUXyUuMJ5LYzTuNL3PF+YoaqvpSSDNQ@mail.gmail.com>
 <CAKw-+5St-tzb8u9KJoA495U44RPzGzL2Qk0p50Q-zwshEr7oBA@mail.gmail.com>
 <CABpxq1=mDQ=qja0ydUVbcyOZZpsBsuOcNgxRxZFmHtO=mSZHGw@mail.gmail.com>
 <CAA_=o7DoUBFQrG5J+JceLJPS+nu==YVr9iRu5sP2pA61EJJ4Uw@mail.gmail.com> <CAKw-+5TtHYQRbcGf6Ky=mJ9P445_eQgS1hFocJrDN9OtU0dxBQ@mail.gmail.com>
In-Reply-To: <CAKw-+5TtHYQRbcGf6Ky=mJ9P445_eQgS1hFocJrDN9OtU0dxBQ@mail.gmail.com>
From: Vinoth Chandar <vinoth@apache.org>
Date: Wed, 4 Aug 2021 10:36:05 -0700
X-Gmail-Original-Message-ID: <CAKw-+5QBx=zumDyeEanJMeOqoD73eCu96EDtZ5Jpr9EyddJ6Dw@mail.gmail.com>
Message-ID: <CAKw-+5QBx=zumDyeEanJMeOqoD73eCu96EDtZ5Jpr9EyddJ6Dw@mail.gmail.com>
Subject: Re: [DISCUSS] Hudi is the data lake platform
To: dev <dev@hudi.apache.org>
Cc: users@hudi.apache.org
Content-Type: multipart/alternative; boundary="000000000000528d2605c8bf3d7e"

--000000000000528d2605c8bf3d7e
Content-Type: text/plain; charset="UTF-8"
Content-Transfer-Encoding: quoted-printable

Folks,

I have been digesting some feedback on what we show on the home page itself=
.

While the blog explains the vision, it might be good to bubble up sub-areas
that are
more relevant to our users today. transactions, updates, deletes.

So, i have raised a PR moving stuff around.

Now we lead with
- "Hudi brings transactions, record-level updates/deletes and change
streams to data lakes"

then explain the platform, in the next level of detail.

https://github.com/apache/hudi/pull/3406

On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar <vinoth@apache.org> wrote:

> Thanks! Will work on it this week.
> Also redoing some images based on feedback.
>
> On Fri, Jul 30, 2021 at 2:06 AM vino yang <yanghua1127@gmail.com> wrote:
>
>> +1
>>
>> Pratyaksh Sharma <pratyaksh13@gmail.com> =E4=BA=8E2021=E5=B9=B47=E6=9C=
=8830=E6=97=A5=E5=91=A8=E4=BA=94 =E4=B8=8A=E5=8D=881:47=E5=86=99=E9=81=93=
=EF=BC=9A
>>
>> > Guess we should rebrand Hudi on README.md file as well -
>> > https://github.com/apache/hudi#readme?
>> >
>> > This page still mentions the following -
>> >
>> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and
>> > Incrementals. Hudi manages the storage of large analytical datasets on
>> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)."
>> >
>> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar <vinoth@apache.org>
>> wrote:
>> >
>> >> Thanks Vino! Got a bunch of emoticons on the PR as well.
>> >>
>> >> Will land this monday, giving it more time over the weekend as well.
>> >>
>> >>
>> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang <yanghua1127@gmail.com>
>> wrote:
>> >>
>> >> > Thanks vc
>> >> >
>> >> > Very good blog, in-depth and forward-looking. Learned!
>> >> >
>> >> > Best,
>> >> > Vino
>> >> >
>> >> > Vinoth Chandar <vinoth@apache.org> =E4=BA=8E2021=E5=B9=B47=E6=9C=88=
22=E6=97=A5=E5=91=A8=E5=9B=9B =E4=B8=8A=E5=8D=883:58=E5=86=99=E9=81=93=EF=
=BC=9A
>> >> >
>> >> > > Expanding to users@ as well.
>> >> > >
>> >> > > Hi all,
>> >> > >
>> >> > > Since this discussion, I started to pen down a coherent strategy
>> and
>> >> > convey
>> >> > > these ideas via a blog post.
>> >> > > I have also done my own research, talked to (ex)colleagues I
>> respect
>> >> to
>> >> > get
>> >> > > their take and refine it.
>> >> > >
>> >> > > Here's a blog that hopefully explains this vision.
>> >> > >
>> >> > > https://github.com/apache/hudi/pull/3322
>> >> > >
>> >> > > Look forward to your feedback on the PR. We are hoping to land th=
is
>> >> early
>> >> > > next week, if everyone is aligned.
>> >> > >
>> >> > > Thanks
>> >> > > Vinoth
>> >> > >
>> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li <lw309637554@gmail.com>
>> wrote:
>> >> > >
>> >> > > > +1 , Cannot agree more.
>> >> > > >  *aux metadata* and metatable, can make hudi have large
>> preformance
>> >> > > > optimization on query end.
>> >> > > > Can continuous develop.
>> >> > > > cache service may the necessary component in cloud native
>> >> environment.
>> >> > > >
>> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar <vinoth@apache.org>
>> wrote:
>> >> > > > > Hello all,
>> >> > > > >
>> >> > > > > Reading one more article today, positioning Hudi, as just a
>> table
>> >> > > format,
>> >> > > > > made me wonder, if we have done enough justice in explaining
>> what
>> >> we
>> >> > > have
>> >> > > > > built together here.
>> >> > > > > I tend to think of Hudi as the data lake platform, which has
>> the
>> >> > > > following
>> >> > > > > components, of which - one if a table format, one is a
>> >> transactional
>> >> > > > > storage layer.
>> >> > > > > But the whole stack we have is definitely worth more than the
>> sum
>> >> of
>> >> > > all
>> >> > > > > the parts IMO (speaking from my own experience from the past
>> 10+
>> >> > years
>> >> > > of
>> >> > > > > open source software dev).
>> >> > > > >
>> >> > > > > Here's what we have built so far.
>> >> > > > >
>> >> > > > > a) *table format* : something that stores table schema, a
>> metadata
>> >> > > table
>> >> > > > > that stores file listing today, and being extended to store
>> column
>> >> > > ranges
>> >> > > > > and more in the future (RFC-27)
>> >> > > > > b) *aux metadata* : bloom filters, external record level
>> indexes
>> >> > today,
>> >> > > > > bitmaps/interval trees and other advanced on-disk data
>> structures
>> >> > > > tomorrow
>> >> > > > > c) *concurrency control* : we always supported MVCC based log
>> >> based
>> >> > > > > concurrency (serialize writes into a time ordered log), and w=
e
>> now
>> >> > also
>> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have
>> >> > multi-table
>> >> > > > and
>> >> > > > > fully non-blocking writers soon (see future work section of
>> >> RFC-22)
>> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case
>> for
>> >> > Hudi,
>> >> > > > but
>> >> > > > > we support primary/unique key constraints and we could add
>> foreign
>> >> > keys
>> >> > > > as
>> >> > > > > an extension, once our transactions can span tables.
>> >> > > > > e) *table services*: a hudi pipeline today is self-managing -
>> >> sizes
>> >> > > > files,
>> >> > > > > cleans, compacts, clusters data, bootstraps existing data - a=
ll
>> >> these
>> >> > > > > actions working off each other without blocking one another.
>> (for
>> >> > most
>> >> > > > > parts).
>> >> > > > > f) *data services*: we also have higher level functionality
>> with
>> >> > > > > deltastreamer sources (scalable DFS listing source, Kafka,
>> Pulsar
>> >> is
>> >> > > > > coming, ...and more), incremental ETL support, de-duplication=
,
>> >> commit
>> >> > > > > callbacks, pre-commit validations are coming, error tables ha=
ve
>> >> been
>> >> > > > > proposed. I could also envision us building towards streaming
>> >> egress,
>> >> > > > data
>> >> > > > > monitoring.
>> >> > > > >
>> >> > > > > I also think we should build the following (subject to separa=
te
>> >> > > > > DISCUSS/RFCs)
>> >> > > > >
>> >> > > > > g) *caching service*: Hudi specific caching service that can
>> hold
>> >> > > mutable
>> >> > > > > data and serve oft-queried data across engines.
>> >> > > > > h) t*imeline metaserver:* We already run a metaserver in spar=
k
>> >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata tabl=
e.
>> >> Let's
>> >> > > > turn
>> >> > > > > it into a scalable, sharded metastore, that all engines can
>> use to
>> >> > > obtain
>> >> > > > > any metadata.
>> >> > > > >
>> >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" a=
s
>> >> > opposed
>> >> > > to
>> >> > > > > "ingests & manages storage of large analytical datasets over
>> DFS
>> >> > (hdfs
>> >> > > or
>> >> > > > > cloud stores)." and convey the scope of our vision,
>> >> > > > > given we have already been building towards that. It would al=
so
>> >> > provide
>> >> > > > new
>> >> > > > > contributors a good lens to look at the project from.
>> >> > > > >
>> >> > > > > (This is very similar to for e.g, the evolution of Kafka from=
 a
>> >> > pub-sub
>> >> > > > > system, to an event streaming platform - with addition of
>> >> > > > > MirrorMaker/Connect etc. )
>> >> > > > >
>> >> > > > > Please share your thoughts!
>> >> > > > >
>> >> > > > > Thanks
>> >> > > > > Vinoth
>> >> > > > >
>> >> > > >
>> >> > >
>> >> >
>> >>
>> >
>>
>

--000000000000528d2605c8bf3d7e--