flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-10074) Allowable number of checkpoint failures
Date Thu, 09 Aug 2018 15:51:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-10074?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16575038#comment-16575038
] 

vinoyang commented on FLINK-10074:
----------------------------------

[~till.rohrmann] yes, I agree with you. If we focus on time, it will become more complicated
for users, because there are multiple time-related configurations that need to understand
some details. And if we focus on the number of times, it will be more user friendly, as if
the maximum number of timeouts and failures.

> Allowable number of checkpoint failures 
> ----------------------------------------
>
>                 Key: FLINK-10074
>                 URL: https://issues.apache.org/jira/browse/FLINK-10074
>             Project: Flink
>          Issue Type: Improvement
>          Components: State Backends, Checkpointing
>            Reporter: Thomas Weise
>            Assignee: vinoyang
>            Priority: Major
>
> For intermittent checkpoint failures it is desirable to have a mechanism to avoid restarts.
If, for example, a transient S3 error prevents checkpoint completion, the next checkpoint
may very well succeed. The user may wish to not incur the expense of restart under such scenario
and this could be expressed with a failure threshold (number of subsequent checkpoint failures),
possibly combined with a list of exceptions to tolerate.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message