flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Created] (FLINK-7216) ExecutionGraph can perform concurrent global restarts to scheduling
Date Mon, 17 Jul 2017 19:34:00 GMT
Stephan Ewen created FLINK-7216:
-----------------------------------

             Summary: ExecutionGraph can perform concurrent global restarts to scheduling
                 Key: FLINK-7216
                 URL: https://issues.apache.org/jira/browse/FLINK-7216
             Project: Flink
          Issue Type: Bug
          Components: Distributed Coordination
    Affects Versions: 1.3.1, 1.2.1
            Reporter: Stephan Ewen
            Assignee: Stephan Ewen
            Priority: Blocker
             Fix For: 1.4.0, 1.3.2


Because ExecutionGraph restarts happen asynchronously and possibly delayed, it can happen
in rare corner cases that two restarts are attempted concurrently, in which case some structures
on the Execution Graph undergo a concurrent access:

Sample stack trace:
{code}
WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to restart the
job.
java.lang.IllegalStateException: SlotSharingGroup cannot clear task assignment, group still
has allocated resources.
    at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup.clearTaskAssignment(SlotSharingGroup.java:78)
    at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:535)
    at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1151)
    at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter$1.call(ExecutionGraphRestarter.java:40)
    at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
    at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
    at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
    at java.util.concurrent.FutureTask.run(FutureTask.java:266)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
    at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
    at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
    at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
    at java.lang.Thread.run(Thread.java:748)
{code}

The solution is to strictly guard against "subsumed" restarts via the {{globalModVersion}}
in a similar way as we fence local restarts against global restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message