flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From SHI Xiaogang <shixiaoga...@gmail.com>
Subject Re: S3 recovery and checkpoint directories exhibit explosive growth
Date Mon, 17 Jul 2017 02:10:20 GMT
Hi Prashantnayak

Thanks a lot for reporting this problem. Can you provide more details to
address it?

I am guessing master has to delete too many files when a checkpoint is
subsumed, which is very common in our cases. The number of files in the
recovery directory will increase if the master cannot delete these files in
time. It usually happens when the checkpoint interval is very small and the
degree of parallelism is very large.

Regards,
Xiaogang


2017-07-15 0:31 GMT+08:00 Stephan Ewen <sewen@apache.org>:

> Hi!
>
> I am looping in Stefan and Xiaogang who worked a lot in incremental
> checkpointing.
>
> Some background on incremental checkpoints: Incremental checkpoints store
> "pieces" of the state (RocksDB ssTables) that are shared between
> checkpoints. Hence it naturally uses more files than no-incremental
> checkpoints.
>
> You could help us understand this with a few more details:
>   - Does it only occur with incremental checkpoints, or also with regular
> checkpoints?
>   - How many checkpoints to you retain?
>   - Do you use externalized checkpoints?
>   - Do you use a highly-available setup with ZooKeeper?
>
> Thanks,
> Stephan
>
>
>
> On Thu, Jul 13, 2017 at 10:43 PM, prashantnayak <
> prashant@intellifylearning.com> wrote:
>
>>
>> To add one more data point... it seems like the recovery directory is the
>> bottleneck somehow..  so if we delete the recovery directory and restart
>> the
>> job manager - it comes back and is responsive.
>>
>> Of course, we lose all jobs, since none can be recovered... and that is of
>> course not ideal.
>>
>> So the question seems to be why the recovery directory grows exponentially
>> in the first place.
>>
>> I can't imagine we're the only ones to see this... or we must be
>> configuring
>> something wrong while testing Flink 1.3.1
>>
>> Thanks for your help in advance
>>
>> Prashant
>>
>>
>>
>> --
>> View this message in context: http://apache-flink-user-maili
>> ng-list-archive.2336050.n4.nabble.com/S3-recovery-and-che
>> ckpoint-directories-exhibit-explosive-growth-tp14270p14271.html
>> Sent from the Apache Flink User Mailing List archive. mailing list
>> archive at Nabble.com.
>>
>
>

Mime
View raw message