hudi-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Jaimin Shah <shahjaimin0...@gmail.com>
Subject Re: KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSION
Date Fri, 14 Jun 2019 15:04:34 GMT
Hi

I am also in favour of restraining KEEP_LATEST_FILE_VERSIONS policy.

 I suspect many people are using hudi as a solution to manage parquet which
is consumed by downstream tools. In my usecase I don’t want to make any
change in consumer logic for downstream tools so KEEP_LATEST_FILE_VERSIONS
and CLEANER_FILE_VERSIONS_RETAINED_PROP = "1" works.

Also I can control when to start consuming data from downstream jobs so I
don’t face issue with files deleted while running query etc.


On Thursday, 13 June 2019, Vinoth Chandar <vinoth@apache.org> wrote:

> yes. we always keep atleast one version out, since deleting it could fail
> the queries..
> Thanks for the feedback. Will not remove it then.
>
> We can work towards Impala support for your use-case, as a long term
> solution. And revisit later may be
>
> On Tue, Jun 11, 2019 at 9:54 PM Gary Li <yanjia.gary.li@gmail.com> wrote:
>
> > Thanks, Vinoth. That's very helpful.
> >
> > When I was using data consumers that don't support hoodie format, I have
> to
> > use KEEP_LATEST_FILE_VERSIONS and CLEANER_FILE_VERSIONS_RETAINED_PROP =
> "1"
> > to keep the parquet files clean, as discussed in
> >https://github.com/apache/incubator-hudi/issues/715  . When I use

> KEEP_LATEST_COMMITS with hoodie.cleaner.commits.retained = "1", I will
> > still have two versions of parquet files.
> >
> > Comparing with running batch jobs, this way actually make my situation
> much
> > better. So I'd recommend not to retire KEEP_LATEST_FILE_VERSIONS and some
> > people might find it useful as I do.
> >
> > Thanks!
> > Gary
> >
> >
> > On Tue, Jun 11, 2019 at 9:20 AM Vinoth Chandar <vinoth@apache.org>
> wrote:
> >
> > > Cool. So, cleaning policy determines how we clean up older versions of
> > file
> > > groups (simplistically old parquet and log files), to bound storage
> > growth,
> > >
> > > KEEP_LATEST_COMMITS (default) : Retains (does not delete) any file
> > (slice)
> > > that was touched in the last X commits. The idea here is that you are
> > able
> > > to pull the incremental changes worth upto X commits.
> > > KEEP_LATEST_FILE_VERSIONS :  If you are not interested in incremental
> > pull
> > > at all, you can choose to just retain X files (slices) per file group
> > (i.e
> > > files that share same prefix) instead. This could result in fewer files
> > in
> > > some cases.
> > >
> > > In practice, we always use KEEP_LATEST_COMMITS, I keep thinking about
> > > starting a discussion to retire LATEST_FILE_VERSIONS actually..
> > >
> > > Hope that helps.
> > >
> > > On Tue, Jun 11, 2019 at 9:05 AM Gary Li <yanjia.gary.li@gmail.com>
> > wrote:
> > >
> > > > Hello Vinoth,
> > > >
> > > > Yes, that’s what I mean.
> > > >
> > > > Thanks
> > > > Gary
> > > >
> > > > On Tue, Jun 11, 2019 at 9:03 AM Vinoth Chandar <vinoth@apache.org>
> > > wrote:
> > > >
> > > > > Hi Gary,
> > > > >
> > > > > Do  you mean cleaning policy?  KEEP_LATEST_FILE_VERSIONS vs
> > > > >  KEEP_LATEST_COMMITS ?
> > > > >
> > > > > Thanks
> > > > > VInoth
> > > > >
> > > > > On Mon, Jun 10, 2019 at 9:47 PM Gary Li <yanjia.gary.li@gmail.com>
> > > > wrote:
> > > > >
> > > > > > Hello,
> > > > > >
> > > > > > I am a little confused when I was looking at the compaction
> policy.
> > > > What
> > > > > is
> > > > > > the difference between KEEP_LATEST_COMMIT vs KEEP_LATEST_VERSION?
> > > What
> > > > is
> > > > > > the exact definition of "COMMIT" and "VERSION"?
> > > > > >
> > > > > > Thanks,
> > > > > > Gary
> > > > > >
> > > > >
> > > >
> > >
> >
>

Mime
  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message