flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: S3 recovery and checkpoint directories exhibit explosive growth
Date Fri, 14 Jul 2017 16:31:27 GMT
Hi!

I am looping in Stefan and Xiaogang who worked a lot in incremental
checkpointing.

Some background on incremental checkpoints: Incremental checkpoints store
"pieces" of the state (RocksDB ssTables) that are shared between
checkpoints. Hence it naturally uses more files than no-incremental
checkpoints.

You could help us understand this with a few more details:
  - Does it only occur with incremental checkpoints, or also with regular
checkpoints?
  - How many checkpoints to you retain?
  - Do you use externalized checkpoints?
  - Do you use a highly-available setup with ZooKeeper?

Thanks,
Stephan



On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak <
prashant@intellifylearning.com> wrote:

>
> To add one more data point... it seems like the recovery directory is the
> bottleneck somehow..  so if we delete the recovery directory and restart
> the
> job manager - it comes back and is responsive.
>
> Of course, we lose all jobs, since none can be recovered... and that is of
> course not ideal.
>
> So the question seems to be why the recovery directory grows exponentially
> in the first place.
>
> I can't imagine we're the only ones to see this... or we must be
> configuring
> something wrong while testing Flink 1.3.1
>
> Thanks for your help in advance
>
> Prashant
>
>
>
> --
> View this message in context: http://apache-flink-user-
> mailing-list-archive.2336050.n4.nabble.com/S3-recovery-and-
> checkpoint-directories-exhibit-explosive-growth-tp14270p14271.html
> Sent from the Apache Flink User Mailing List archive. mailing list archive
> at Nabble.com.
>

Mime
View raw message