flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Nico Kruber <n...@data-artisans.com>
Subject Re: Checkpoint expired before completing
Date Fri, 01 Dec 2017 14:17:05 GMT
Hi Steven,
by default, checkpoints time out after 10 minutes if you haven't used
CheckpointConfig#setCheckpointTimeout() to change this timeout.

Depending on your checkpoint interval, and your number of concurrent
checkpoints, there may already be some other checkpoint processes
running while you are waiting for the first to finish. In that case,
succeeding checkpoints may also fail with a timeout. However, they
should definitely get back to normal once your sink has caught up with
all buffered events.

I included Stefan who may shed some more light onto it, but maybe you
can help us identifying the problem by providing logs at DEBUG level
(did akka report any connection loss and gated actors? or maybe some
other error in there?) or even a minimal program to reproduce.


Nico

On 01/12/17 07:36, Steven Wu wrote:
> 
> org.apache.flink.runtime.checkpoint.CheckpointCoordinator - Checkpoint
> 9353 expired before completing
> 
> I might know why this happened in the first place. Our sink operator
> does synchronous HTTP post, which had a 15-mint latency spike when this
> all started. This could block flink threads and prevent checkpoint from
> completing in time. But I don't understand why checkpoint continued to
> fail after HTTP post latency returned to normal. there seems to be some
> lingering/cascading effect of previous failed checkpoints on future
> checkpoints. Only after I redeploy/restart the job an hour later,
> checkpoint starts to work again.
> 
> Would appreciate any suggestions/insights!
> 
> Thanks,
> Steven


Mime
View raw message