flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sihua Zhou (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-9661) TTL state should support to do time shift after restoring from checkpoint( savepoint).
Date Tue, 26 Jun 2018 03:40:00 GMT
Sihua Zhou created FLINK-9661:

             Summary: TTL state should support to do time shift after restoring from checkpoint(
                 Key: FLINK-9661
                 URL: https://issues.apache.org/jira/browse/FLINK-9661
             Project: Flink
          Issue Type: Improvement
          Components: State Backends, Checkpointing
    Affects Versions: 1.6.0
            Reporter: Sihua Zhou

The initial version of the TTL-state appends the expired timestamp along the state record,
and check the expired timestamp with the condition {{expired_timestamp <= current_time}}
when accessing the state, if it is true then the record is expired, otherwise it is still
alive. This could works pretty fine in the most cases, but in some case, we need to do time
shift, otherwise it may cause some unexpected result when using the ProccessTime, I roughly
describe two case as follow.

- when restoring the job from the savepoint

For example, the user set the TTL to 2h for the state, if he trigger a savepoint and restore
the job from the savepoint after 2h(maybe some reason that delay he to restore the job quickly),
then the restored job's previous state data are all expired.

- when the job spend a long time to recover from a failure

For example, there are many jobs running on a yarn session cluster, and the cluster configured
to use the DFS to store the checkpoint data, but unfortunately, the DFS meet a strange problem
which makes the jobs on the cluster begin to loop in recovery-fail-recovery-fail... the devs
spend some time to address the issue of DFS and the jobs start working properly, but if the
"{{system down time >= TTL}}" then the job's previous state data will be expired in this

To avoid the problems as above, we need to do time shift after the job recovering from checkpoint
& savepoint. A possible approach is outlined in [6186|https://github.com/apache/flink/pull/6186].

This message was sent by Atlassian JIRA

View raw message