flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Shai Kaplan <Shai.Kap...@microsoft.com>
Subject Flink checkpointing gets stuck
Date Tue, 21 Feb 2017 13:47:14 GMT
Hi.
I'm running a Flink 1.2 job with a 10 seconds checkpoint interval. After some running time
(minutes-hours) Flink fails to save checkpoints, and stops processing records (I'm not sure
if the checkpointing failure is the cause of the problem or just a symptom).
After several checkpoints that take some seconds each, they start failing due to 30 minutes
timeout.
When I restart one of the Task Manager services (just to get the job restarted), the job is
recovered from the last successful checkpoint (the state size continues to grow, so it's probably
not the reason for the failure), advances somewhat, saves some more checkpoints, and then
enters the failing state again.
One of the times it happened, the first failed checkpoint failed due to "Checkpoint Coordinator
is suspending.", so it might be an indicator for the cause of the problem, but looking into
Flink's code I can't see how a running job could get to this state.
I am using RocksDB for state, and the state is saved to Azure Blob Store, using the NativeAzureFileSystem
HDFS connector over the wasbs protocol.
Any ideas? Possibly a bug in Flink or RocksDB?

Mime
View raw message