flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Till Rohrmann (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9788) ExecutionGraph Inconsistency prevents Job from recovering
Date Tue, 09 Oct 2018 07:50:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9788?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16642919#comment-16642919
] 

Till Rohrmann commented on FLINK-9788:
--------------------------------------

Arg, this sounds quite bad. Thanks a lot for diagnosing the problem [~SleePy]. We should definitely
fix this problem for 1.7. I'll mark it as a blocker.

I could think of two high level solutions here:
1. Ignore failures if one is in state RESTARTING because it must originate from the previous
run. Here we need to check whether {{failGlobal}} is really only called by a running {{ExecutionGraph}}
2. Cancel the subsumed restarting operation such that eventually the latest restarting operation
will succeed.

> ExecutionGraph Inconsistency prevents Job from recovering
> ---------------------------------------------------------
>
>                 Key: FLINK-9788
>                 URL: https://issues.apache.org/jira/browse/FLINK-9788
>             Project: Flink
>          Issue Type: Bug
>          Components: Core
>    Affects Versions: 1.6.0
>         Environment: Rev: 4a06160
> Hadoop 2.8.3
>            Reporter: Gary Yao
>            Priority: Critical
>             Fix For: 1.7.0, 1.6.2
>
>         Attachments: jobmanager_5000.log
>
>
> Deployment mode: YARN job mode with HA
> After killing many TaskManagers in succession, the state of the ExecutionGraph ran into
an inconsistent state, which prevented job recovery. The following stacktrace was logged in
the JobManager log several hundred times per second:
> {noformat}
> -08 16:47:18,855 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph       
- Job General purpose test job (37a794195840700b98feb23e99f7ea24) switched from state RESTARTING
to RESTARTING.
> 2018-07-08 16:47:18,856 INFO  org.apache.flink.runtime.executiongraph.ExecutionGraph
       - Restarting the job General purpose test job (37a794195840700b98feb23e99f7ea24).
> 2018-07-08 16:47:18,857 DEBUG org.apache.flink.runtime.executiongraph.ExecutionGraph
       - Resetting execution vertex Source: Custom Source -> Timestamps/Watermarks (1/10)
for new execution.
> 2018-07-08 16:47:18,857 WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph
       - Failed to restart the job.
> java.lang.IllegalStateException: Cannot reset a vertex that is in non-terminal state
CREATED
>         at org.apache.flink.runtime.executiongraph.ExecutionVertex.resetForNewExecution(ExecutionVertex.java:610)
>         at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:573)
>         at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1251)
>         at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestartCallback.triggerFullRecovery(ExecutionGraphRestartCallback.java:59)
>         at org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy$1.run(FixedDelayRestartStrategy.java:68)
>         at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>         at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>         at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>         at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
>         at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
>         at java.lang.Thread.run(Thread.java:748)
> {noformat}
> The resulting jobmanager log file was 4.7 GB in size. Find attached the first 5000 lines
of the log file. 



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message