flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-7844) Fine Grained Recovery triggers checkpoint timeout failure
Date Mon, 16 Oct 2017 14:54:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-7844?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16206017#comment-16206017
] 

Till Rohrmann commented on FLINK-7844:
--------------------------------------

Hi [~zhenzhongxu],

thanks for the feedback. The checkpoint timeout is currently hardcoded to 10 minutes. We should
definitely expose this to the user via a configuration option.

Moreover, it looks to me as if we don't fail the currently ongoing checkpoint in case of a
fine grained recovery. We also don't resend the checkpoint barriers for the restarted component.
Therefore, the checkpoint will never complete and the {{CheckpointCoordinator}} will wait
until the checkpoint times out. The easiest solution should be to fail all currently ongoing
checkpoints upon fine grained recovery. That's also what happens in the case of a global recovery.

I will split this issue into two issues. One adding the configuration option for the timeout
and this one for the abort of all currently ongoing checkpoints.

> Fine Grained Recovery triggers checkpoint timeout failure
> ---------------------------------------------------------
>
>                 Key: FLINK-7844
>                 URL: https://issues.apache.org/jira/browse/FLINK-7844
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.3.2
>            Reporter: Zhenzhong Xu
>         Attachments: screenshot-1.png
>
>
> Context: 
> We are using "individual" failover (fine-grained) recovery strategy for our embarrassingly
parallel router use case. The topic has over 2000 partitions, and parallelism is set to ~180
that dispatched to over 20 task managers with around 180 slots.
> Observations:
> We've noticed after one task manager termination, even though the individual recovery
happens correctly, that the workload was re-dispatched to a new available task manager instance.
However, the checkpoint would take 10 mins to eventually timeout, causing all other task managers
not able to commit checkpoints. In a worst-case scenario, if job got restarted for other reasons
(i.e. job manager termination), that would cause more messages to be re-processed/duplicates
compared to the job without fine-grained recovery enabled.
> I am suspecting that uber checkpoint was waiting for a previous checkpoint that initiated
by the old task manager and thus taking a long time to time out.
> Two questions:
> 1. Is there a configuration that controls this checkpoint timeout?
> 2. Is there any reason that when Job Manager realizes that Task Manager is gone and workload
is redispatched, it still need to wait for the checkpoint initiated by the old task manager?
> Checkpoint screenshot in attachments.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message