From dev-return-4277-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Wed Aug 4 17:36:20 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-he-de.apache.org (mxout1-he-de.apache.org [95.216.194.37]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 59BCB18068A for ; Wed, 4 Aug 2021 19:36:20 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-he-de.apache.org (ASF Mail Server at mxout1-he-de.apache.org) with SMTP id D04F961BC8 for ; Wed, 4 Aug 2021 17:36:18 +0000 (UTC) Received: (qmail 44892 invoked by uid 500); 4 Aug 2021 17:36:17 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 44865 invoked by uid 99); 4 Aug 2021 17:36:17 -0000 Received: from ec2-52-204-25-47.compute-1.amazonaws.com (HELO mailrelay1-ec2-va.apache.org) (52.204.25.47) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 04 Aug 2021 17:36:17 +0000 Received: from mail-lf1-f53.google.com (mail-lf1-f53.google.com [209.85.167.53]) by mailrelay1-ec2-va.apache.org (ASF Mail Server at mailrelay1-ec2-va.apache.org) with ESMTPSA id 835F53EA10; Wed, 4 Aug 2021 17:36:17 +0000 (UTC) Received: by mail-lf1-f53.google.com with SMTP id g13so5740127lfj.12; Wed, 04 Aug 2021 10:36:17 -0700 (PDT) X-Gm-Message-State: AOAM533BTf+qyPoDKBx/ie/7APgH5Wiyd4QAAqF7X4DlQMUsrANgAUjM he2BUiWtYo/X4X06s5DlJvcSnJKCBTQg+h8ZCP0= X-Google-Smtp-Source: ABdhPJzZmx3lyideKHV/oyfZuFlbG9NSww3Oe9ZItrVvuQYdpP1ZCRKw5Kp85IEkoVHpFoU5RZ1jXjxrdRuZnYyAAkA= X-Received: by 2002:a05:6512:6d2:: with SMTP id u18mr277795lff.486.1628098576680; Wed, 04 Aug 2021 10:36:16 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: Vinoth Chandar Date: Wed, 4 Aug 2021 10:36:05 -0700 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: [DISCUSS] Hudi is the data lake platform To: dev Cc: users@hudi.apache.org Content-Type: multipart/alternative; boundary="000000000000528d2605c8bf3d7e" --000000000000528d2605c8bf3d7e Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Folks, I have been digesting some feedback on what we show on the home page itself= . While the blog explains the vision, it might be good to bubble up sub-areas that are more relevant to our users today. transactions, updates, deletes. So, i have raised a PR moving stuff around. Now we lead with - "Hudi brings transactions, record-level updates/deletes and change streams to data lakes" then explain the platform, in the next level of detail. https://github.com/apache/hudi/pull/3406 On Mon, Aug 2, 2021 at 9:39 AM Vinoth Chandar wrote: > Thanks! Will work on it this week. > Also redoing some images based on feedback. > > On Fri, Jul 30, 2021 at 2:06 AM vino yang wrote: > >> +1 >> >> Pratyaksh Sharma =E4=BA=8E2021=E5=B9=B47=E6=9C= =8830=E6=97=A5=E5=91=A8=E4=BA=94 =E4=B8=8A=E5=8D=881:47=E5=86=99=E9=81=93= =EF=BC=9A >> >> > Guess we should rebrand Hudi on README.md file as well - >> > https://github.com/apache/hudi#readme? >> > >> > This page still mentions the following - >> > >> > "Apache Hudi (pronounced Hoodie) stands for Hadoop Upserts Deletes and >> > Incrementals. Hudi manages the storage of large analytical datasets on >> > DFS (Cloud stores, HDFS or any Hadoop FileSystem compatible storage)." >> > >> > On Sat, Jul 24, 2021 at 6:31 AM Vinoth Chandar >> wrote: >> > >> >> Thanks Vino! Got a bunch of emoticons on the PR as well. >> >> >> >> Will land this monday, giving it more time over the weekend as well. >> >> >> >> >> >> On Wed, Jul 21, 2021 at 7:36 PM vino yang >> wrote: >> >> >> >> > Thanks vc >> >> > >> >> > Very good blog, in-depth and forward-looking. Learned! >> >> > >> >> > Best, >> >> > Vino >> >> > >> >> > Vinoth Chandar =E4=BA=8E2021=E5=B9=B47=E6=9C=88= 22=E6=97=A5=E5=91=A8=E5=9B=9B =E4=B8=8A=E5=8D=883:58=E5=86=99=E9=81=93=EF= =BC=9A >> >> > >> >> > > Expanding to users@ as well. >> >> > > >> >> > > Hi all, >> >> > > >> >> > > Since this discussion, I started to pen down a coherent strategy >> and >> >> > convey >> >> > > these ideas via a blog post. >> >> > > I have also done my own research, talked to (ex)colleagues I >> respect >> >> to >> >> > get >> >> > > their take and refine it. >> >> > > >> >> > > Here's a blog that hopefully explains this vision. >> >> > > >> >> > > https://github.com/apache/hudi/pull/3322 >> >> > > >> >> > > Look forward to your feedback on the PR. We are hoping to land th= is >> >> early >> >> > > next week, if everyone is aligned. >> >> > > >> >> > > Thanks >> >> > > Vinoth >> >> > > >> >> > > On Wed, Apr 21, 2021 at 9:01 PM wei li >> wrote: >> >> > > >> >> > > > +1 , Cannot agree more. >> >> > > > *aux metadata* and metatable, can make hudi have large >> preformance >> >> > > > optimization on query end. >> >> > > > Can continuous develop. >> >> > > > cache service may the necessary component in cloud native >> >> environment. >> >> > > > >> >> > > > On 2021/04/13 05:29:55, Vinoth Chandar >> wrote: >> >> > > > > Hello all, >> >> > > > > >> >> > > > > Reading one more article today, positioning Hudi, as just a >> table >> >> > > format, >> >> > > > > made me wonder, if we have done enough justice in explaining >> what >> >> we >> >> > > have >> >> > > > > built together here. >> >> > > > > I tend to think of Hudi as the data lake platform, which has >> the >> >> > > > following >> >> > > > > components, of which - one if a table format, one is a >> >> transactional >> >> > > > > storage layer. >> >> > > > > But the whole stack we have is definitely worth more than the >> sum >> >> of >> >> > > all >> >> > > > > the parts IMO (speaking from my own experience from the past >> 10+ >> >> > years >> >> > > of >> >> > > > > open source software dev). >> >> > > > > >> >> > > > > Here's what we have built so far. >> >> > > > > >> >> > > > > a) *table format* : something that stores table schema, a >> metadata >> >> > > table >> >> > > > > that stores file listing today, and being extended to store >> column >> >> > > ranges >> >> > > > > and more in the future (RFC-27) >> >> > > > > b) *aux metadata* : bloom filters, external record level >> indexes >> >> > today, >> >> > > > > bitmaps/interval trees and other advanced on-disk data >> structures >> >> > > > tomorrow >> >> > > > > c) *concurrency control* : we always supported MVCC based log >> >> based >> >> > > > > concurrency (serialize writes into a time ordered log), and w= e >> now >> >> > also >> >> > > > > have OCC for batch merge workloads with 0.8.0. We will have >> >> > multi-table >> >> > > > and >> >> > > > > fully non-blocking writers soon (see future work section of >> >> RFC-22) >> >> > > > > d) *updates/deletes* : this is the bread-and-butter use-case >> for >> >> > Hudi, >> >> > > > but >> >> > > > > we support primary/unique key constraints and we could add >> foreign >> >> > keys >> >> > > > as >> >> > > > > an extension, once our transactions can span tables. >> >> > > > > e) *table services*: a hudi pipeline today is self-managing - >> >> sizes >> >> > > > files, >> >> > > > > cleans, compacts, clusters data, bootstraps existing data - a= ll >> >> these >> >> > > > > actions working off each other without blocking one another. >> (for >> >> > most >> >> > > > > parts). >> >> > > > > f) *data services*: we also have higher level functionality >> with >> >> > > > > deltastreamer sources (scalable DFS listing source, Kafka, >> Pulsar >> >> is >> >> > > > > coming, ...and more), incremental ETL support, de-duplication= , >> >> commit >> >> > > > > callbacks, pre-commit validations are coming, error tables ha= ve >> >> been >> >> > > > > proposed. I could also envision us building towards streaming >> >> egress, >> >> > > > data >> >> > > > > monitoring. >> >> > > > > >> >> > > > > I also think we should build the following (subject to separa= te >> >> > > > > DISCUSS/RFCs) >> >> > > > > >> >> > > > > g) *caching service*: Hudi specific caching service that can >> hold >> >> > > mutable >> >> > > > > data and serve oft-queried data across engines. >> >> > > > > h) t*imeline metaserver:* We already run a metaserver in spar= k >> >> > > > > writer/drivers, backed by rocksDB & even Hudi's metadata tabl= e. >> >> Let's >> >> > > > turn >> >> > > > > it into a scalable, sharded metastore, that all engines can >> use to >> >> > > obtain >> >> > > > > any metadata. >> >> > > > > >> >> > > > > To this end, I propose we rebrand to "*Data Lake Platform*" a= s >> >> > opposed >> >> > > to >> >> > > > > "ingests & manages storage of large analytical datasets over >> DFS >> >> > (hdfs >> >> > > or >> >> > > > > cloud stores)." and convey the scope of our vision, >> >> > > > > given we have already been building towards that. It would al= so >> >> > provide >> >> > > > new >> >> > > > > contributors a good lens to look at the project from. >> >> > > > > >> >> > > > > (This is very similar to for e.g, the evolution of Kafka from= a >> >> > pub-sub >> >> > > > > system, to an event streaming platform - with addition of >> >> > > > > MirrorMaker/Connect etc. ) >> >> > > > > >> >> > > > > Please share your thoughts! >> >> > > > > >> >> > > > > Thanks >> >> > > > > Vinoth >> >> > > > > >> >> > > > >> >> > > >> >> > >> >> >> > >> > --000000000000528d2605c8bf3d7e--