flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "vinoyang (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-9352) In Standalone checkpoint recover mode many jobs with same checkpoint interval cause IO pressure
Date Mon, 14 May 2018 07:58:00 GMT
vinoyang created FLINK-9352:
-------------------------------

             Summary: In Standalone checkpoint recover mode many jobs with same checkpoint
interval cause IO pressure
                 Key: FLINK-9352
                 URL: https://issues.apache.org/jira/browse/FLINK-9352
             Project: Flink
          Issue Type: Bug
          Components: State Backends, Checkpointing
            Reporter: vinoyang
            Assignee: vinoyang


currently, the periodic checkpoint coordinator startCheckpointScheduler uses *baseInterval*
as the initialDelay parameter. the *baseInterval* is also the checkpoint interval. 

In standalone checkpoint mode, many jobs config the same checkpoint interval. When all jobs
being recovered (the cluster restart or jobmanager leadership switched), all jobs' checkpoint period
will tend to accordance. All jobs' CheckpointCoordinator would start and trigger in a approximate
time point.

This caused the high IO cost in the same time period in our production scenario.

I suggest let the scheduleAtFixedRate's initial delay parameter as a API config which can
let user scatter checkpoint in this scenario.

 

cc [~StephanEwen] [~Zentol]



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message