flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Roman Khachatryan (Jira)" <j...@apache.org>
Subject [jira] [Resolved] (FLINK-17869) Fix the race condition of aborting unaligned checkpoint
Date Wed, 10 Jun 2020 20:35:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-17869?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Roman Khachatryan resolved FLINK-17869.
---------------------------------------
    Fix Version/s: 1.12.0
       Resolution: Fixed

> Fix the race condition of aborting unaligned checkpoint
> -------------------------------------------------------
>
>                 Key: FLINK-17869
>                 URL: https://issues.apache.org/jira/browse/FLINK-17869
>             Project: Flink
>          Issue Type: Bug
>          Components: Runtime / Checkpointing
>            Reporter: Zhijiang
>            Assignee: Roman Khachatryan
>            Priority: Blocker
>              Labels: pull-request-available
>             Fix For: 1.11.0, 1.12.0
>
>
> On ChannelStateWriter side, the lifecycle of checkpoint should be as follows:
> start -> in progress/abort -> stop.
> The ChannelStateWriteResult is created during #start, and removed by #abort or #stop
processes. There are some potential race conditions here:
>  * #start is called while receiving the first barrier by netty thread and schedule to
execute the checkpoint
>  * The task thread might process cancel checkpoint and call #abort before performing
the above respective checkpoint
>  * The checkpoint can still be executed by task thread afterwards even thought the above
abort happened before, because we can not remove the checkpoint action from mailbox during
aborting.
>  * While checkpoint executing, it will call `ChannelStateWriter#getWriteResult` then
it would cause `IllegalStateException` because the respective result was already removed in
advance during handling #abort method before.
>  * Therefore it will cause unnecessary task failure during performing checkpoint
> I guess we do not want to fail the task when one checkpoint is aborted by design. And
the illegal state check during ChannelStateWriter#getWriteResult was mainly proposed for
normal process validation I guess.
> If we do not remove the ChannelStateWriteResult while handling #abort and rely on #stop
to remove it, then it might probably exist another scenario that the checkpoint will never
be performed after #start (we have another mechanism to exit the triggering checkpoint in
advance if the abort is sent by CheckpointCoordinator), then the legacy ChannelStateWriteResult
will be retained inside ChannelStateWriter long time.
> Maybe the potential option to fix this issue is to let SubtaskCheckpointCoordinatorImpl
handle the exception from ChannelStateWriter#getWriteResult properly to not fail the task
in the aborted case.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message