From dev-return-4245-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Thu Jul 22 02:36:48 2021 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mxout1-ec2-va.apache.org (mxout1-ec2-va.apache.org [3.227.148.255]) by mx-eu-01.ponee.io (Postfix) with ESMTPS id 98F3F18064E for ; Thu, 22 Jul 2021 04:36:48 +0200 (CEST) Received: from mail.apache.org (mailroute1-lw-us.apache.org [207.244.88.153]) by mxout1-ec2-va.apache.org (ASF Mail Server at mxout1-ec2-va.apache.org) with SMTP id CC03843E0F for ; Thu, 22 Jul 2021 02:36:47 +0000 (UTC) Received: (qmail 23557 invoked by uid 500); 22 Jul 2021 02:36:46 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 23535 invoked by uid 99); 22 Jul 2021 02:36:44 -0000 Received: from spamproc1-he-de.apache.org (HELO spamproc1-he-de.apache.org) (116.203.196.100) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 22 Jul 2021 02:36:44 +0000 Received: from localhost (localhost [127.0.0.1]) by spamproc1-he-de.apache.org (ASF Mail Server at spamproc1-he-de.apache.org) with ESMTP id BC7631FF48E; Thu, 22 Jul 2021 02:36:43 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamproc1-he-de.apache.org X-Spam-Flag: NO X-Spam-Score: 0.249 X-Spam-Level: X-Spam-Status: No, score=0.249 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, HTML_MESSAGE=0.2, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamproc1-he-de.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-he-de.apache.org ([116.203.227.195]) by localhost (spamproc1-he-de.apache.org [116.203.196.100]) (amavisd-new, port 10024) with ESMTP id OMsM7imiNF3q; Thu, 22 Jul 2021 02:36:42 +0000 (UTC) Received-SPF: Pass (mailfrom) identity=mailfrom; client-ip=2a00:1450:4864:20::62e; helo=mail-ej1-x62e.google.com; envelope-from=yanghua1127@gmail.com; receiver= Received: from mail-ej1-x62e.google.com (mail-ej1-x62e.google.com [IPv6:2a00:1450:4864:20::62e]) by mx1-he-de.apache.org (ASF Mail Server at mx1-he-de.apache.org) with ESMTPS id 61E1F7FED4; Thu, 22 Jul 2021 02:36:42 +0000 (UTC) Received: by mail-ej1-x62e.google.com with SMTP id dp20so6189248ejc.7; Wed, 21 Jul 2021 19:36:42 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:references:in-reply-to:from:date:message-id:subject:to :cc; bh=tBwZ5kZglLVIl/BodumbEiP8g+YYF3IuI9B6xEQKFGw=; b=Jf/3iDxu1ObfHG19Pk3vKMcGsfIF2psxe/lJLoJf2Cr3lfathtTg0ts2owdYolQ/fk xfPpAVXBp//4Tm46gPuZCPBKG7Cm1BCpVm3gUv2RSV9Qg7rrGtNeOz3wuzKjES/EVjOo i/wLcAnrWLlQSraj+mRQKlGyOLUpcbE6uApBJvhfIkuzmdm9nyz2+ByijX1btcxgRqei SAGE5wwb/9ts5bMal6EAjAsRLFILmo83sHI31LvV9SUBbUDfBlcLh4tDQ+/BIwJ6lhPw wbIo0xVs6uYX4lpjFVOCHrfykt8L0zVfNY3d65HLFFJaeakat0HSU9JzIMQycHPn2YfH YtQQ== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:references:in-reply-to:from:date :message-id:subject:to:cc; bh=tBwZ5kZglLVIl/BodumbEiP8g+YYF3IuI9B6xEQKFGw=; b=Po5bY7EtSyeEsBseMCg8ot9nwD8l5W3jibHGNr+pDPUgrOhCWwRBz4pxU3NnJLN61i MOlnovGoujiD+WvPI3DA0Q+G5d/OrV6P878I8dupa7j02xEJOEBNUhOb+a9qtmKl/Q4l apMAJwjykupkioWOydqxqajTKzRMAv4dgaz8NyR7kfCp6ynOwRJInSBnjLaJ8OBV+eBh m9fwHMQnxKROnNDeRN9hUBFnfDiwVn2Y4upYIWQTUTXjWjUwIn0sKIhnx9WHtPFlnR4P IkYuymF/NmbqIHD0lp+RSwkamL5BhoPt1N0eJP9xD4e8fG7gnEG/ft04U1Khqv5a8CQB PfgQ== X-Gm-Message-State: AOAM531nt43/Y4QaLi14x/hEFQYkmIwYqhmi3mUAC4YBuTamRpZDyemN nj0WLySafjXvfdYDtjcSBCAahboRVgX6MUhp20f4Fgiu4cg= X-Google-Smtp-Source: ABdhPJz9IZvlGvSYZFxTx4BvYuutCaBjmqUp8k45a92AlQxh7S2okt0HJgX22BhZbFCyBJtj1Sr0iWxW5ZUv4XLdwk8= X-Received: by 2002:a17:906:a202:: with SMTP id r2mr40650656ejy.398.1626921401754; Wed, 21 Jul 2021 19:36:41 -0700 (PDT) MIME-Version: 1.0 References: In-Reply-To: From: vino yang Date: Thu, 22 Jul 2021 10:36:30 +0800 Message-ID: Subject: Re: [DISCUSS] Hudi is the data lake platform To: dev Cc: users@hudi.apache.org Content-Type: multipart/alternative; boundary="0000000000003aa81405c7ad28ed" --0000000000003aa81405c7ad28ed Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Thanks vc Very good blog, in-depth and forward-looking. Learned! Best, Vino Vinoth Chandar =E4=BA=8E2021=E5=B9=B47=E6=9C=8822=E6=97= =A5=E5=91=A8=E5=9B=9B =E4=B8=8A=E5=8D=883:58=E5=86=99=E9=81=93=EF=BC=9A > Expanding to users@ as well. > > Hi all, > > Since this discussion, I started to pen down a coherent strategy and conv= ey > these ideas via a blog post. > I have also done my own research, talked to (ex)colleagues I respect to g= et > their take and refine it. > > Here's a blog that hopefully explains this vision. > > https://github.com/apache/hudi/pull/3322 > > Look forward to your feedback on the PR. We are hoping to land this early > next week, if everyone is aligned. > > Thanks > Vinoth > > On Wed, Apr 21, 2021 at 9:01 PM wei li wrote: > > > +1 , Cannot agree more. > > *aux metadata* and metatable, can make hudi have large preformance > > optimization on query end. > > Can continuous develop. > > cache service may the necessary component in cloud native environment. > > > > On 2021/04/13 05:29:55, Vinoth Chandar wrote: > > > Hello all, > > > > > > Reading one more article today, positioning Hudi, as just a table > format, > > > made me wonder, if we have done enough justice in explaining what we > have > > > built together here. > > > I tend to think of Hudi as the data lake platform, which has the > > following > > > components, of which - one if a table format, one is a transactional > > > storage layer. > > > But the whole stack we have is definitely worth more than the sum of > all > > > the parts IMO (speaking from my own experience from the past 10+ year= s > of > > > open source software dev). > > > > > > Here's what we have built so far. > > > > > > a) *table format* : something that stores table schema, a metadata > table > > > that stores file listing today, and being extended to store column > ranges > > > and more in the future (RFC-27) > > > b) *aux metadata* : bloom filters, external record level indexes toda= y, > > > bitmaps/interval trees and other advanced on-disk data structures > > tomorrow > > > c) *concurrency control* : we always supported MVCC based log based > > > concurrency (serialize writes into a time ordered log), and we now al= so > > > have OCC for batch merge workloads with 0.8.0. We will have multi-tab= le > > and > > > fully non-blocking writers soon (see future work section of RFC-22) > > > d) *updates/deletes* : this is the bread-and-butter use-case for Hudi= , > > but > > > we support primary/unique key constraints and we could add foreign ke= ys > > as > > > an extension, once our transactions can span tables. > > > e) *table services*: a hudi pipeline today is self-managing - sizes > > files, > > > cleans, compacts, clusters data, bootstraps existing data - all these > > > actions working off each other without blocking one another. (for mos= t > > > parts). > > > f) *data services*: we also have higher level functionality with > > > deltastreamer sources (scalable DFS listing source, Kafka, Pulsar is > > > coming, ...and more), incremental ETL support, de-duplication, commit > > > callbacks, pre-commit validations are coming, error tables have been > > > proposed. I could also envision us building towards streaming egress, > > data > > > monitoring. > > > > > > I also think we should build the following (subject to separate > > > DISCUSS/RFCs) > > > > > > g) *caching service*: Hudi specific caching service that can hold > mutable > > > data and serve oft-queried data across engines. > > > h) t*imeline metaserver:* We already run a metaserver in spark > > > writer/drivers, backed by rocksDB & even Hudi's metadata table. Let's > > turn > > > it into a scalable, sharded metastore, that all engines can use to > obtain > > > any metadata. > > > > > > To this end, I propose we rebrand to "*Data Lake Platform*" as oppose= d > to > > > "ingests & manages storage of large analytical datasets over DFS (hdf= s > or > > > cloud stores)." and convey the scope of our vision, > > > given we have already been building towards that. It would also provi= de > > new > > > contributors a good lens to look at the project from. > > > > > > (This is very similar to for e.g, the evolution of Kafka from a pub-s= ub > > > system, to an event streaming platform - with addition of > > > MirrorMaker/Connect etc. ) > > > > > > Please share your thoughts! > > > > > > Thanks > > > Vinoth > > > > > > --0000000000003aa81405c7ad28ed--