flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10074) Allowable number of checkpoint failures
Date Tue, 14 Aug 2018 09:48:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16579555#comment-16579555
] 

Till Rohrmann commented on FLINK-10074:
---------------------------------------

Since {{setFailOnCheckpointingErrors}} is public, we cannot simply change its signature. What
we could do though, is to add another method {{setNumberTolerableCheckpointFailures(int)}}
which is set by default to {{0}} and is only respected if {{setFailOnCheckpointingErrors}}
is set to {{true}}. So if the the user on calls {{setFailOnCheckpointingErrors(true)}} then
he will get the same old behaviour. Only after calling {{setNumberTolerableCheckpointFailures(10)}},
it will wait for 10 checkpoint failures before failing. If {{setNumberTolerableCheckpointFailures}}
is set but {{setFailOnCheckpointingErrors(false)}}, then checkpoint failures won't fail the
job.

[~thw] would you not reset the counter in case of a restart? This would be hard to do in case
of a JobManager failover and lead to different behaviours depending on the actual fault.

> Allowable number of checkpoint failures 
> ----------------------------------------
>
>                 Key: FLINK-10074
>                 URL: https://issues.apache.org/jira/browse/FLINK-10074
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>            Reporter: Thomas Weise
>            Assignee: vinoyang
>            Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to avoid restarts.
If, for example, a transient S3 error prevents checkpoint completion, the next checkpoint
may very well succeed. The user may wish to not incur the expense of restart under such scenario
and this could be expressed with a failure threshold (number of subsequent checkpoint failures),
possibly combined with a list of exceptions to tolerate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message