flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Stephan Ewen (JIRA)" <j...@apache.org>
Subject [jira] [Resolved] (FLINK-7216) ExecutionGraph can perform concurrent global restarts to scheduling
Date Sun, 23 Jul 2017 14:27:00 GMT

     [ https://issues.apache.org/jira/browse/FLINK-7216?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Stephan Ewen resolved FLINK-7216.
---------------------------------
      Resolution: Fixed
    Release Note: 
Fixed in 
  - 1.4.0 via 74a6cbab4e736cdb353d100cdd29f51809325796
  - 1.3.2 via e6348fbde1fc0ee8ea682063a4d6503ba3b68864

> ExecutionGraph can perform concurrent global restarts to scheduling
> -------------------------------------------------------------------
>
>                 Key: FLINK-7216
>                 URL: https://issues.apache.org/jira/browse/FLINK-7216
>             Project: Flink
>          Issue Type: Bug
>          Components: Distributed Coordination
>    Affects Versions: 1.2.1, 1.3.1
>            Reporter: Stephan Ewen
>            Assignee: Stephan Ewen
>            Priority: Blocker
>             Fix For: 1.4.0, 1.3.2
>
>
> Because ExecutionGraph restarts happen asynchronously and possibly delayed, it can happen
in rare corner cases that two restarts are attempted concurrently, in which case some structures
on the Execution Graph undergo a concurrent access:
> Sample stack trace:
> {code}
> WARN  org.apache.flink.runtime.executiongraph.ExecutionGraph        - Failed to restart
the job.
> java.lang.IllegalStateException: SlotSharingGroup cannot clear task assignment, group
still has allocated resources.
>     at org.apache.flink.runtime.jobmanager.scheduler.SlotSharingGroup.clearTaskAssignment(SlotSharingGroup.java:78)
>     at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:535)
>     at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:1151)
>     at org.apache.flink.runtime.executiongraph.restart.ExecutionGraphRestarter$1.call(ExecutionGraphRestarter.java:40)
>     at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:95)
>     at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
>     at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
>     at java.util.concurrent.Executors$RunnableAdapter.call(Executors.java:511)
>     at java.util.concurrent.FutureTask.run(FutureTask.java:266)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.access$201(ScheduledThreadPoolExecutor.java:180)
>     at java.util.concurrent.ScheduledThreadPoolExecutor$ScheduledFutureTask.run(ScheduledThreadPoolExecutor.java:293)
>     at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1142)
>     at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:617)
>     at java.lang.Thread.run(Thread.java:748)
> {code}
> The solution is to strictly guard against "subsumed" restarts via the {{globalModVersion}}
in a similar way as we fence local restarts against global restarts.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

Mime
View raw message