From dev-return-689-archive-asf-public=cust-asf.ponee.io@hudi.apache.org Fri Jun 14 15:04:39 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [207.244.88.153]) by mx-eu-01.ponee.io (Postfix) with SMTP id 23BB018062F for ; Fri, 14 Jun 2019 17:04:39 +0200 (CEST) Received: (qmail 96658 invoked by uid 500); 14 Jun 2019 15:04:38 -0000 Mailing-List: contact dev-help@hudi.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@hudi.apache.org Delivered-To: mailing list dev@hudi.apache.org Received: (qmail 96646 invoked by uid 99); 14 Jun 2019 15:04:38 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd4-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 14 Jun 2019 15:04:38 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd4-us-west.apache.org (ASF Mail Server at spamd4-us-west.apache.org) with ESMTP id 7ECBAC2155 for ; Fri, 14 Jun 2019 15:04:37 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd4-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: 2.81 X-Spam-Level: ** X-Spam-Status: No, score=2.81 tagged_above=-999 required=6.31 tests=[DKIM_SIGNED=0.1, DKIM_VALID=-0.1, DKIM_VALID_AU=-0.1, DKIM_VALID_EF=-0.1, FREEMAIL_ENVFROM_END_DIGIT=0.25, GB_FROM_NAME_FREEMAIL=0.01, HTML_MESSAGE=2, KAM_ADVERT2=0.75, RCVD_IN_DNSWL_NONE=-0.0001, SPF_HELO_NONE=0.001, SPF_PASS=-0.001] autolearn=disabled Authentication-Results: spamd4-us-west.apache.org (amavisd-new); dkim=pass (2048-bit key) header.d=gmail.com Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd4-us-west.apache.org [10.40.0.11]) (amavisd-new, port 10024) with ESMTP id Jh9Ea3mjmOKw for ; Fri, 14 Jun 2019 15:04:35 +0000 (UTC) Received: from mail-io1-f45.google.com (mail-io1-f45.google.com [209.85.166.45]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTPS id 88ED25F180 for ; Fri, 14 Jun 2019 15:04:35 +0000 (UTC) Received: by mail-io1-f45.google.com with SMTP id s7so6365379iob.11 for ; Fri, 14 Jun 2019 08:04:35 -0700 (PDT) DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=gmail.com; s=20161025; h=mime-version:in-reply-to:references:from:date:message-id:subject:to; bh=BRhXMHpODVw36xuj7DgLqPoJvBzXXIkRwPRVeHjieVg=; b=V+dG4+C0Vtubbpe/SQa0n7UJX2h0KvHRIoCik3LVwC6O11Pam90rlJVcKOWN09DYqj iywBcQAPS+2sqvQNY6n5TZbFkblhpdQWe3IV30HyON0c9sCS40gR/kSuyzM5FzK0AlQ8 gEQrdFSIa56MHfpBX11eoFRFrmMhOYbZcKdzlyf3UsDWjdqSI6b8gosYHIUycIoNvjOh EU9+Vw6cAZu1F+HKNMkmpPCamCZAZyhcpzt1Ra7te288nzVej8655LNqFeOrAUU5+xpi 0I6G30QivC25HiBaynUXpH7499RjjvvtOy1u7JCnKFBomVGTQbqONMbugLYMM2K1WwRe NIvw== X-Google-DKIM-Signature: v=1; a=rsa-sha256; c=relaxed/relaxed; d=1e100.net; s=20161025; h=x-gm-message-state:mime-version:in-reply-to:references:from:date :message-id:subject:to; bh=BRhXMHpODVw36xuj7DgLqPoJvBzXXIkRwPRVeHjieVg=; b=YOKPw6mwlCzG8mgSDEDyqpdCXBQVey/qrpOqVudc/trGJNMIpKrrDrI0xMTqOwgIOD Zhdnzx/m2hlmBA3v3v9jW9A0mJIeURMYukqNKbxG3z+s8hDXYs0dDyLPkFcjmOlHqCSh KRw4aHF+WdQxt1WWrjFqaYrRsiIpCor+8DIV9AB9DDI30WN4vg5Fr/YE1VbEHwO6tAvh H4KYzvNRd7TRyssYmPhBZlogjbMVx+fWuHE5FnBXiRkvnZqWujdJ8/I6P/GcJdDsniT8 5b+63H4YodvBupihuI/KSS/l8Ddt1ETr31r77sfsdBZwqSAG3rHM8RxlyloRwjTmXftq MVuw== X-Gm-Message-State: APjAAAXNb0nAQ7ejsjazPDGEPQ7vqDSMfYTx3dd+F0Gr5rbCERgYJbdJ rGZh4sZc7Uu+gV3WARFjptr0YJ9r+WuLT4E1WJXSlQ== X-Google-Smtp-Source: APXvYqycqq+iYdP921BmyHefCOVC7qK2Vz2OKElFIN07X7wHLZ+NcNalJK09O95qdGOBZdeRpGD4K+3AIuy+dNknh8g= X-Received: by 2002:a5d:9d58:: with SMTP id k24mr16497704iok.116.1560524674727; Fri, 14 Jun 2019 08:04:34 -0700 (PDT) MIME-Version: 1.0 Received: by 2002:a02:331c:0:0:0:0:0 with HTTP; Fri, 14 Jun 2019 08:04:34 -0700 (PDT) In-Reply-To: References: From: Jaimin Shah Date: Fri, 14 Jun 2019 20:34:34 +0530 Message-ID: Subject: Re: KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSION To: "dev@hudi.apache.org" Content-Type: multipart/alternative; boundary="000000000000e683d3058b49f6f5" --000000000000e683d3058b49f6f5 Content-Type: text/plain; charset="UTF-8" Content-Transfer-Encoding: quoted-printable Hi I am also in favour of restraining KEEP_LATEST_FILE_VERSIONS policy. I suspect many people are using hudi as a solution to manage parquet which is consumed by downstream tools. In my usecase I don=E2=80=99t want to make= any change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP =3D "1" works. Also I can control when to start consuming data from downstream jobs so I don=E2=80=99t face issue with files deleted while running query etc. On Thursday, 13 June 2019, Vinoth Chandar wrote: > yes. we always keep atleast one version out, since deleting it could fail > the queries.. > Thanks for the feedback. Will not remove it then. > > We can work towards Impala support for your use-case, as a long term > solution. And revisit later may be > > On Tue, Jun 11, 2019 at 9:54 PM Gary Li wrote: > > > Thanks, Vinoth. That's very helpful. > > > > When I was using data consumers that don't support hoodie format, I hav= e > to > > use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP = =3D > "1" > > to keep the parquet files clean, as discussed in > >https://github.com/apache/incubator-hudi/issues/715 . When I use > KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained =3D "1", I will > > still have two versions of parquet files. > > > > Comparing with running batch jobs, this way actually make my situation > much > > better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and so= me > > people might find it useful as I do. > > > > Thanks! > > Gary > > > > > > On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar > wrote: > > > > > Cool. So, cleaning policy determines how we clean up older versions o= f > > file > > > groups (simplistically old parquet and log files), to bound storage > > growth, > > > > > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file > > (slice) > > > that was touched in the last X commits. The idea here is that you are > > able > > > to pull the incremental changes worth upto X commits. > > > KEEP_LATEST_FILE_VERSIONS : If you are not interested in incremental > > pull > > > at all, you can choose to just retain X files (slices) per file group > > (i.e > > > files that share same prefix) instead. This could result in fewer fil= es > > in > > > some cases. > > > > > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about > > > starting a discussion to retire LATEST_FILE_VERSIONS actually.. > > > > > > Hope that helps. > > > > > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li > > wrote: > > > > > > > Hello Vinoth, > > > > > > > > Yes, that=E2=80=99s what I mean. > > > > > > > > Thanks > > > > Gary > > > > > > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar > > > wrote: > > > > > > > > > Hi Gary, > > > > > > > > > > Do you mean cleaning policy? KEEP_LATEST_FILE_VERSIONS vs > > > > > KEEP_LATEST_COMMITS ? > > > > > > > > > > Thanks > > > > > VInoth > > > > > > > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li > > > > wrote: > > > > > > > > > > > Hello, > > > > > > > > > > > > I am a little confused when I was looking at the compaction > policy. > > > > What > > > > > is > > > > > > the difference between KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSIO= N? > > > What > > > > is > > > > > > the exact definition of "COMMIT" and "VERSION"? > > > > > > > > > > > > Thanks, > > > > > > Gary > > > > > > > > > > > > > > > > > > > > > --000000000000e683d3058b49f6f5--