flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: JobManager trying to re-submit jobs after failover
Date Wed, 27 Jul 2016 12:21:11 GMT
Which version of Flink are you running on? I think this might have
been fixed for the 1.1 release
(http://people.apache.org/~uce/flink-1.1.0-rc1/).

It looks like the ExecutionGraph is still trying to restart although
the JobManager is not the leader anymore. If you could provide the
complete logs of both JobManagers, that would be helpful to be sure
what is happening.

You can recover from this by restarting the respective JobManager
process (by running "jobmanager.sh stop" script on that machine and
starting again via "jobmanager.sh start cluster") .

– Ufuk

On Wed, Jul 27, 2016 at 2:00 PM, Hironori Ogibayashi
<ogibayashi@gmail.com> wrote:
> Hello,
>
> I have standalone Flink cluster with JobManager HA.
> Last night, JobManager failovered because of the connection timeout to
> Zookeeper.
> Job is successfully running under new leader JobManager, but when
> I see the old leader JobManager log, it is trying to re-submit job and
> getting errors. ( for almost 24 hours now)
>
> Here is the log.
>
> -----
> 2016-07-27 20:56:09,218 WARN
> org.apache.flink.runtime.jobmanager.JobManager                -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:09     Job execution switched to status RESTARTING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,218 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Recovering checkpoints from ZooKeeper.
> 2016-07-27 20:56:19,218 WARN
> org.apache.flink.runtime.jobmanager.JobManager                -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19     Job execution switched to status CREATED.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,219 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Found 1 checkpoints in ZooKeeper.
> 2016-07-27 20:56:19,221 INFO
> org.apache.flink.runtime.checkpoint.ZooKeeperCompletedCheckpointStore
> - Initialized with Checkpoint 40229 @ 1469620528216 for
> 978ef000cca5a3aa6f3461274102f82c. Removing all older checkpoints.
> 2016-07-27 20:56:19,222 WARN
> org.apache.flink.runtime.jobmanager.JobManager                -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19     Job execution switched to status RUNNING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,222 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
> switched from CREATED to SCHEDULED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (1/3) (bbdf55db0c19cc881c188bc6925929d0)
> switched from SCHEDULED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (2/3) (4c795c671ec7b548b5faac5b141c331c)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 WARN
> org.apache.flink.runtime.jobmanager.JobManager                -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19     Job execution switched to status FAILING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (3/3) (fce3b243e5b25041aafabbd93a266dbc)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (1/3) (e1e5154f506901539e12b0fe8c140503)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (2/3) (f95eb0ad8fcc50e6bb9046e8700e8778)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> Source: Custom Source (3/3) (0e30de47933282533cf6dda3a22e7ddc)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
> Map (1/3) (ea260b7740d4ac8262c6500429b0ee6b) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
> Map (2/3) (cc5ab4fc296238d32078d2b4a8eb5062) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        - Flat
> Map (3/3) (9694ae32fc12ec416197308f6a8cb3c1) switched from CREATED to
> CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (1/3) (9c6b27873b6ddec58ce3f82f62632152)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (2/3) (47442827157e04f7e1936ec1d5c876e9)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.ExecutionGraph        -
> TriggerWindow(GlobalWindows(),
> FoldingStateDescriptor{name=window-contents,
> defaultValue=ViewerCountHll(0,0,,com.clearspring.analytics.stream.cardinality.HyperLogLogPlus@1),
> serializer=null}, LiveContinuousProcessingTimeTriggerGlobal(10000),
> WindowedStream.fold(WindowedStream.java:207)) -> Filter -> Map ->
> Filter -> Sink: Unnamed (3/3) (a1436ef922932ffbb38f5c23684a43ec)
> switched from CREATED to CANCELED
> 2016-07-27 20:56:19,223 INFO
> org.apache.flink.runtime.executiongraph.restart.FixedDelayRestartStrategy
>  - Delaying retry of job execution for 10000 ms ...
> 2016-07-27 20:56:19,223 WARN
> org.apache.flink.runtime.jobmanager.JobManager                -
> Discard message
> LeaderSessionMessage(54757d58-64d0-4118-a4d3-5f089287f1e4,07/27/2016
> 20:56:19     Job execution switched to status RESTARTING.) because the
> expected leader session ID None did not equal the received leader
> session ID Some(54757d58-64d0-4118-a4d3-5f089287f1e4).
> ----
>
> Could anyone advise me why this happens and how I can recover from
> this situation? (restart JobManager?)
>
> Regards,
> Hironori Ogibayashi

Mime
View raw message