flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Piotr Nowojski (Jira)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-18641) "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
Date Thu, 23 Jul 2020 11:05:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-18641?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=17163425#comment-17163425
] 

Piotr Nowojski commented on FLINK-18641:
----------------------------------------

[~becket_qin] yes you are right that the problem was introduced in FLINK-13905 in 1.11.0.

{quote}
By design, the checkpoint should always actually take the snapshot of the master hooks and
OperatorCoordinator first before taking the checkpoint on the tasks.
{quote}
Regarding the ordering guarantees. I don't understand where does this strict ordering comes
from? Java docs of the {{MasterTriggerRestoreHook}} doesn't seem to support this statement:
{code} 
	 * <p>If the action should be executed asynchronously and only needs to complete before
the
	 * checkpoint is considered completed, then the method may use the given executor to execute
the
	 * actual action and would signal its completion by completing the future. 
{code}
So synchronously waiting for the hook's future to complete would solve the problem, but what
I proposed above:
{quote}
And the solution would be to include waiting for async hook to complete, before completing/finalising
the checkpoint?
{quote}
Should also be correct, right?

As a sidenote, I'm +1 for the easiest solution to solve this bug without causing regressions
compared to 1.10/1.9 and without undermining {{CheckpointCoordinator}} threading model refactor
(which we still need to complete). As you both mentioned, {{MasterTriggerRestoreHook}} is
on it's way out to be replaced by FLIP-27.

> "Failure to finalize checkpoint" error in MasterTriggerRestoreHook
> ------------------------------------------------------------------
>
>                 Key: FLINK-18641
>                 URL: https://issues.apache.org/jira/browse/FLINK-18641
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>    Affects Versions: 1.11.0
>            Reporter: Brian Zhou
>            Priority: Major
>
> https://github.com/pravega/flink-connectors is a Pravega connector for Flink. The ReaderCheckpointHook[1]
class uses the Flink `MasterTriggerRestoreHook` interface to trigger the Pravega checkpoint
during Flink checkpoints to make sure the data recovery. The checkpoint recovery tests are
running fine in Flink 1.10, but it has below issues in Flink 1.11 causing the tests time out.
Suspect it is related to the checkpoint coordinator thread model changes in Flink 1.11
> Error stacktrace:
> {code}
> 2020-07-09 15:39:39,999 30945 [jobmanager-future-thread-5] WARN  o.a.f.runtime.jobmaster.JobMaster
- Error while processing checkpoint acknowledgement message
> org.apache.flink.runtime.checkpoint.CheckpointException: Could not finalize the pending
checkpoint 3. Failure reason: Failure to finalize checkpoint.
>          at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1033)
>          at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.receiveAcknowledgeMessage(CheckpointCoordinator.java:948)
>          at org.apache.flink.runtime.scheduler.SchedulerBase.lambda$acknowledgeCheckpoint$4(SchedulerBase.java:802)
>          at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>          at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>          at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>          at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>          at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>          at java.lang.Thread.run(Thread.java:748)
> Caused by: org.apache.flink.util.SerializedThrowable: Pending checkpoint has not been
fully acknowledged yet
>          at org.apache.flink.util.Preconditions.checkState(Preconditions.java:195)
>          at org.apache.flink.runtime.checkpoint.PendingCheckpoint.finalizeCheckpoint(PendingCheckpoint.java:298)
>          at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.completePendingCheckpoint(CheckpointCoordinator.java:1021)
>          ... 9 common frames omitted
> {code}
> More detail in this mailing thread: http://apache-flink-user-mailing-list-archive.2336050.n4.nabble.com/Pravega-connector-cannot-recover-from-the-checkpoint-due-to-quot-Failure-to-finalize-checkpoint-quot-td36652.html
> Also in https://github.com/pravega/flink-connectors/issues/387



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message