flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ufuk Celebi (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-2133) Possible deadlock in ExecutionGraph
Date Tue, 02 Jun 2015 13:03:17 GMT

    [ https://issues.apache.org/jira/browse/FLINK-2133?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=14569062#comment-14569062
] 

Ufuk Celebi commented on FLINK-2133:
------------------------------------

I've looked at the ExecutionGraph and this seems to be a simple deadlock due to the ordering
of lock acquisitions.

Two tasks of the same JobVertex aquire the locks in the following order:
- T1 (ForkJoinPool-1-worker-3): ExecutionGraph#restart() aquires ExecutionGraph#progressLock
=> ExecutionJobVertex#reset() aquires ExecutionJobVertex#stateMonitor
- T2 (flink-akka.actor.default-dispatcher-4): ExecutionJobVertex#subtaskInFinalState acquires
ExecutionJobVertex#stateMonitor to cancel task => ExecutionGraph#jobVertexInFinalState()
aquires ExecutionGraph#progressLock

I think that both messages have to be triggered by the same task, because both actions should
only happen for the final vertex (I think cancel (transition to cancelling) and canceling
complete msg (transition to cancelled)). 

> Possible deadlock in ExecutionGraph
> -----------------------------------
>
>                 Key: FLINK-2133
>                 URL: https://issues.apache.org/jira/browse/FLINK-2133
>             Project: Flink
>          Issue Type: Bug
>            Reporter: Aljoscha Krettek
>
> I had the following output on Travis:
> {code}
> Found one Java-level deadlock:
> =============================
> "ForkJoinPool-1-worker-3":
>   waiting to lock monitor 0x00007f1c54af7eb8 (object 0x00000000d77fa8c0, a org.apache.flink.runtime.util.SerializableObject),
>   which is held by "flink-akka.actor.default-dispatcher-4"
> "flink-akka.actor.default-dispatcher-4":
>   waiting to lock monitor 0x00007f1c5486aca0 (object 0x00000000d77fa218, a org.apache.flink.runtime.util.SerializableObject),
>   which is held by "ForkJoinPool-1-worker-3"
> Java stack information for the threads listed above:
> ===================================================
> "ForkJoinPool-1-worker-3":
> 	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.resetForNewExecution(ExecutionJobVertex.java:338)
> 	- waiting to lock <0x00000000d77fa8c0> (a org.apache.flink.runtime.util.SerializableObject)
> 	at org.apache.flink.runtime.executiongraph.ExecutionGraph.restart(ExecutionGraph.java:595)
> 	- locked <0x00000000d77fa218> (a org.apache.flink.runtime.util.SerializableObject)
> 	at org.apache.flink.runtime.executiongraph.ExecutionGraph$3.call(ExecutionGraph.java:733)
> 	at akka.dispatch.Futures$$anonfun$future$1.apply(Future.scala:94)
> 	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> 	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> 	at scala.concurrent.impl.ExecutionContextImpl$$anon$3.exec(ExecutionContextImpl.scala:107)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1339)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> "flink-akka.actor.default-dispatcher-4":
> 	at org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVertexInFinalState(ExecutionGraph.java:683)
> 	- waiting to lock <0x00000000d77fa218> (a org.apache.flink.runtime.util.SerializableObject)
> 	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.subtaskInFinalState(ExecutionJobVertex.java:454)
> 	- locked <0x00000000d77fa8c0> (a org.apache.flink.runtime.util.SerializableObject)
> 	at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.vertexCancelled(ExecutionJobVertex.java:426)
> 	at org.apache.flink.runtime.executiongraph.ExecutionVertex.executionCanceled(ExecutionVertex.java:565)
> 	at org.apache.flink.runtime.executiongraph.Execution.cancelingComplete(Execution.java:653)
> 	at org.apache.flink.runtime.executiongraph.ExecutionGraph.updateState(ExecutionGraph.java:784)
> 	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply$mcV$sp(JobManager.scala:220)
> 	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
> 	at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$receiveWithLogMessages$1$$anonfun$applyOrElse$2.apply(JobManager.scala:219)
> 	at scala.concurrent.impl.Future$PromiseCompletingRunnable.liftedTree1$1(Future.scala:24)
> 	at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(Future.scala:24)
> 	at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
> 	at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.exec(AbstractDispatcher.scala:401)
> 	at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.java:260)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExecAll(ForkJoinPool.java:1253)
> 	at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(ForkJoinPool.java:1346)
> 	at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPool.java:1979)
> 	at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinWorkerThread.java:107)
> Found 1 deadlock.
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message