flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: S3 recovery and checkpoint directories exhibit explosive growth
Date Fri, 14 Jul 2017 16:31:27 GMT

I am looping in Stefan and Xiaogang who worked a lot in incremental

Some background on incremental checkpoints: Incremental checkpoints store
"pieces" of the state (RocksDB ssTables) that are shared between
checkpoints. Hence it naturally uses more files than no-incremental

You could help us understand this with a few more details:
  - Does it only occur with incremental checkpoints, or also with regular
  - How many checkpoints to you retain?
  - Do you use externalized checkpoints?
  - Do you use a highly-available setup with ZooKeeper?


On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak <
prashant@intellifylearning.com> wrote:

> To add one more data point... it seems like the recovery directory is the
> bottleneck somehow..  so if we delete the recovery directory and restart
> the
> job manager - it comes back and is responsive.
> Of course, we lose all jobs, since none can be recovered... and that is of
> course not ideal.
> So the question seems to be why the recovery directory grows exponentially
> in the first place.
> I can't imagine we're the only ones to see this... or we must be
> configuring
> something wrong while testing Flink 1.3.1
> Thanks for your help in advance
> Prashant
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-
> checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.

View raw message