flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Yassine MARZOUGUI <y.marzou...@mindlytix.com>
Subject Re: Triggering a saveppoint failed the Job
Date Thu, 05 Jan 2017 21:40:23 GMT
Hi Stephan,

Thank you for creating the JIRA issue, I attached a job reproducing the bug
in the issue page and commented it.

Best,
Yassine

2017-01-04 12:55 GMT+01:00 Stephan Ewen <sewen@apache.org>:

> Hi!
>
> Thanks for reporting this.
>
> I created a JIRA issue for it: https://issues.apache.org/
> jira/browse/FLINK-5407
>
> We'll look into it as part of the 1.2 release testing. If you have any
> more details that may help diagnose/fix that, would be great if you could
> share them with us.
>
> Thanks,
> Stephan
>
>
> On Wed, Jan 4, 2017 at 10:52 AM, Yassine MARZOUGUI <
> y.marzougui@mindlytix.com> wrote:
>
>> Hi all,
>>
>> I tried to trigger a savepoint for a streaming job, both the savepoint
>> and the job failed.
>>
>> The job failed with the following exception:
>>
>> java.lang.RuntimeException: Error while triggering checkpoint for IterationSource-7
(1/1)
>> 	at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1026)
>> 	at java.util.concurrent.Executors$RunnableAdapter.call(Unknown Source)
>> 	at java.util.concurrent.FutureTask.run(Unknown Source)
>> 	at java.util.concurrent.ThreadPoolExecutor.runWorker(Unknown Source)
>> 	at java.util.concurrent.ThreadPoolExecutor$Worker.run(Unknown Source)
>> 	at java.lang.Thread.run(Unknown Source)
>> Caused by: java.lang.NullPointerException
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.createOperatorIdentifier(StreamTask.java:767)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.access$500(StreamTask.java:115)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.createStreamFactory(StreamTask.java:986)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask$CheckpointingOperation.executeCheckpointing(StreamTask.java:956)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.checkpointState(StreamTask.java:583)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.performCheckpoint(StreamTask.java:551)
>> 	at org.apache.flink.streaming.runtime.tasks.StreamTask.triggerCheckpoint(StreamTask.java:511)
>> 	at org.apache.flink.runtime.taskmanager.Task$3.run(Task.java:1019)
>> 	... 5 more
>>
>>
>> And the savepoint failed with the following exception:
>>
>> Using address /127.0.0.1:6123 to connect to JobManager.
>> Triggering savepoint for job 153310c4a836a92ce69151757c6b73f1.
>> Waiting for response...
>>
>> ------------------------------------------------------------
>>  The program finished with the following exception:
>>
>> java.lang.Exception: Failed to complete savepoint
>>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>> handleMessage$1$$anon$7.apply(JobManager.scala:793)
>>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>> handleMessage$1$$anon$7.apply(JobManager.scala:782)
>>         at org.apache.flink.runtime.concurrent.impl.FlinkFuture$6.recov
>> er(FlinkFuture.java:263)
>>         at akka.dispatch.Recover.internal(Future.scala:267)
>>         at akka.dispatch.japi$RecoverBridge.apply(Future.scala:183)
>>         at akka.dispatch.japi$RecoverBridge.apply(Future.scala:181)
>>         at scala.util.Failure$$anonfun$recover$1.apply(Try.scala:185)
>>         at scala.util.Try$.apply(Try.scala:161)
>>         at scala.util.Failure.recover(Try.scala:185)
>>         at scala.concurrent.Future$$anonfun$recover$1.apply(Future.
>> scala:324)
>>         at scala.concurrent.Future$$anonfun$recover$1.apply(Future.
>> scala:324)
>>         at scala.concurrent.impl.CallbackRunnable.run(Promise.scala:32)
>>         at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.
>> processBatch$1(BatchingExecutor.scala:67)
>>         at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply$
>> mcV$sp(BatchingExecutor.scala:82)
>>         at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(Ba
>> tchingExecutor.scala:59)
>>         at akka.dispatch.BatchingExecutor$Batch$$anonfun$run$1.apply(Ba
>> tchingExecutor.scala:59)
>>         at scala.concurrent.BlockContext$.withBlockContext(BlockContext
>> .scala:72)
>>         at akka.dispatch.BatchingExecutor$Batch.run(BatchingExecutor.
>> scala:58)
>>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>>         at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.
>> exec(AbstractDispatcher.scala:401)
>>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
>> java:260)
>>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>> ForkJoinPool.java:1339)
>>         at scala.concurrent.forkjoin.ForkJoinPool.runWorker(ForkJoinPoo
>> l.java:1979)
>>         at scala.concurrent.forkjoin.ForkJoinWorkerThread.run(ForkJoinW
>> orkerThread.java:107)
>> Caused by: java.lang.Exception: Checkpoint failed: Checkpoint Coordinator
>> is shutting down
>>         at org.apache.flink.runtime.checkpoint.PendingCheckpoint.abortE
>> rror(PendingCheckpoint.java:338)
>>         at org.apache.flink.runtime.checkpoint.CheckpointCoordinator.sh
>> utdown(CheckpointCoordinator.java:245)
>>         at org.apache.flink.runtime.executiongraph.ExecutionGraph.postR
>> unCleanup(ExecutionGraph.java:1065)
>>         at org.apache.flink.runtime.executiongraph.ExecutionGraph.jobVe
>> rtexInFinalState(ExecutionGraph.java:1034)
>>         at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.s
>> ubtaskInFinalState(ExecutionJobVertex.java:435)
>>         at org.apache.flink.runtime.executiongraph.ExecutionJobVertex.v
>> ertexCancelled(ExecutionJobVertex.java:407)
>>         at org.apache.flink.runtime.executiongraph.ExecutionVertex.exec
>> utionCanceled(ExecutionVertex.java:593)
>>         at org.apache.flink.runtime.executiongraph.Execution.cancelingC
>> omplete(Execution.java:729)
>>         at org.apache.flink.runtime.executiongraph.ExecutionGraph.updat
>> eState(ExecutionGraph.java:1105)
>>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>> handleMessage$1$$anonfun$applyOrElse$5.apply$mcV$sp(JobManager.scala:687)
>>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>> handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:686)
>>         at org.apache.flink.runtime.jobmanager.JobManager$$anonfun$
>> handleMessage$1$$anonfun$applyOrElse$5.apply(JobManager.scala:686)
>>         at scala.concurrent.impl.Future$PromiseCompletingRunnable.lifte
>> dTree1$1(Future.scala:24)
>>         at scala.concurrent.impl.Future$PromiseCompletingRunnable.run(F
>> uture.scala:24)
>>         at akka.dispatch.TaskInvocation.run(AbstractDispatcher.scala:41)
>>         at akka.dispatch.ForkJoinExecutorConfigurator$AkkaForkJoinTask.
>> exec(AbstractDispatcher.scala:401)
>>         at scala.concurrent.forkjoin.ForkJoinTask.doExec(ForkJoinTask.
>> java:260)
>>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.pollAndExec
>> All(ForkJoinPool.java:1253)
>>         at scala.concurrent.forkjoin.ForkJoinPool$WorkQueue.runTask(
>> ForkJoinPool.java:1346)
>>         ... 2 more
>> Caused by: java.lang.Exception: Checkpoint Coordinator is shutting down
>>         ... 20 more
>>
>> Maybe worth mentionning : the iteration body contains MapFunction and its
>> thread was in a sleep state (put manually) during the savepoint.
>> I'm using Flink 1.2-SNAPSHOT (Commit: 5b54009) on Windows.
>>
>> Any idea why this happened? Thank you.
>>
>> Best,
>> Yassine
>>
>>
>

Mime
View raw message