flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@data-artisans.com>
Subject Re: Having a backoff while experiencing checkpointing failures
Date Mon, 11 Jun 2018 08:08:48 GMT

I think the behaviour of min_pause_between_checkpoints is either buggy or we should at least
discuss if it would not be better to respect a pause also for failed checkpoints. As far as
I know there is no ongoing work to add backoff, so I suggest you open a jira issue and make
a case for this.


> Am 08.06.2018 um 06:30 schrieb vipul singh <neoeahit@gmail.com>:
> Hello all,
> Are there any recommendations on using a backoff when experiencing checkpointing failures?
> What we have seen is when a checkpoint starts to expire, the next checkpoint dosent care
about the previous failure, and starts soon after. We experimented with min_pause_between_checkpoints,
however that seems only to work for successful checkpoints( the same is discussed on this
thread <http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/minPauseBetweenCheckpoints-for-failed-checkpoints-td20152.html>)
> Are there any recommendations on how to have a backoff or is there something in works
to add a backoff incase of checkpointing failures? This seems very valuable incase of checkpointing
on an external location like s3, where one can be potentially throttled or gets errors like
TooBusyException from s3(for example like in this jira <https://issues.apache.org/jira/browse/FLINK-9061>)
> Please let us know!
> Thanks,
> Vipul

View raw message