flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Zhu Zhu (Jira)" <j...@apache.org>
Subject [jira] [Created] (FLINK-21707) Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers
Date Wed, 10 Mar 2021 07:55:00 GMT
Zhu Zhu created FLINK-21707:
-------------------------------

             Summary: Job is possible to hang when restarting a FINISHED task with POINTWISE
BLOCKING consumers
                 Key: FLINK-21707
                 URL: https://issues.apache.org/jira/browse/FLINK-21707
             Project: Flink
          Issue Type: Bug
          Components: Runtime / Coordination
    Affects Versions: 1.12.2, 1.11.3, 1.13.0
            Reporter: Zhu Zhu


Job is possible to hang when restarting a FINISHED task with POINTWISE BLOCKING consumers.
This is because {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}} will try to
schedule all the consumer tasks/regions of the finished *ExecutionJobVertex*, even though
the regions are not the exact consumers of the finished *ExecutionVertex*. In this case, some
of the regions can be in state other than CREATED because they are not connected to and affected
by the restarted tasks. However, {{PipelinedRegionSchedulingStrategy#maybeScheduleRegion()}}
does not allow to schedule a non-CREATED region and will throw an Exception and breaks the
scheduling of all the other regions. One example to show this problem case can be found at
[PipelinedRegionSchedulingITCase#testRecoverFromPartitionException |https://github.com/zhuzhurk/flink/commit/1eb036b6566c5cb4958d9957ba84dc78ce62a08c].

To fix the problem, we can add a filter in {{PipelinedRegionSchedulingStrategy#onExecutionStateChange()}}
to only trigger the scheduling of regions in CREATED state.



--
This message was sent by Atlassian Jira
(v8.3.4#803005)

Mime
View raw message