flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From domi...@dbruhn.de
Subject Tooling for resuming from checkpoints
Date Wed, 22 Nov 2017 09:41:23 GMT
Hey,
we are running Flink 1.3.2 with streaming jobs and we are running into 
issues when we are restarting a complete job (which can happen due to 
various reasons: upgrading of the job, restarting of the cluster, 
failures). The problem is that there is no automated way to find out 
from which checkpoint-metadata (so externalized checkpoint) we should 
resume. There can always be the situation that we are left with multiple 
of those files: Now you want to use the most recent one which is 
successfully written.

Is there any tooling available already which picks the latest good 
checkpoint? Or at least a tool/commandline which we can use to validate 
that a checkpoint is valid so we can pick the latest one?

How are others handling this? Manually?

Would be happy to get some input there,
Dominik

Mime
View raw message