flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robin Cassan <robin.cas...@contentsquare.com>
Subject Re: Making job fail on Checkpoint Expired?
Date Fri, 03 Apr 2020 12:35:40 GMT
Hi Congxian,

Thanks for confirming! The reason I want this behavior is because we are
currently investigating issues with checkpoints that keep getting timeouts
after the job has been running for a few hours. We observed that, after a
few timeouts, if the job was being restarted because of a lost TM for
example, the next checkpoints would be working for a few more hours.
However, if the job continues running and consuming more data, the next
checkpoints will be even bigger and the chances of them completing in time
are getting even thinner.
Crashing the job is not a viable solution I agree, but it would allow us to
generate data during the time we investigate the root cause of the timeouts.

I believe that having the option to make the job restart after a few
checkpoint timeouts would still help to avoid the snowball effect of
incremental checkpoints being bigger and bigger if the checkpoints keep
getting expired.

I'd love to get your opinion on this!


Le ven. 3 avr. 2020 à 11:17, Congxian Qiu <qcx978132955@gmail.com> a écrit :

> Currently, only checkpoint declined will be counted into
> `continuousFailureCounter`.
> Could you please share why do you want the job to fail when checkpoint
> expired?
> Best,
> Congxian
> Timo Walther <twalthr@apache.org> 于2020年4月2日周四 下午11:23写道:
>> Hi Robin,
>> this is a very good observation and maybe even unintended behavior.
>> Maybe Arvid in CC is more familiar with the checkpointing?
>> Regards,
>> Timo
>> On 02.04.20 15:37, Robin Cassan wrote:
>> > Hi all,
>> >
>> > I am wondering if there is a way to make a flink job fail (not cancel
>> > it) when one or several checkpoints have failed due to being expired
>> > (taking longer than the timeout) ?
>> > I am using Flink 1.9.2 and have set
>> > `*setTolerableCheckpointFailureNumber(1)*` which doesn't do the trick.
>> > Looking into the CheckpointFailureManager.java class, it looks like
>> this
>> > only works when the checkpoint failure reason is
>> > `*CHECKPOINT_DECLINED*`, but the number of failures isn't incremented
>> on
>> > Am I missing something?
>> >
>> > Thanks!

View raw message