flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@data-artisans.com>
Subject Re: difference between checkpoints & savepoints
Date Thu, 10 Aug 2017 16:38:41 GMT
I most of the things you are asking for are already there: you can configure checkpoint interval
+ externalized cp, the backend, and the location for savepoints and externalized checkpoints.
You can restart from savepoints and externalized checkpoints from the CLI. One point I am
not entirely sure about are automatic CP or SP when a job is shut down. IIRC, this is either
already available, or in the making.

Resolving the last external checkpoint is as easy as listing the configured directory, especially
if you only retain the last one. Otherwise the timestamp gives the required information. It
is true that there could also be an CLI option to automatically does the work to pick the

And there is a command line parameter switch to supply savepoints and externalized checkpoints
for restarts. I think that makes more sense than a general configuration of automatic restart
behaviour because the user might also intend to start a new, clean run for the job.

> Am 10.08.2017 um 15:45 schrieb Henri Heiskanen <henri.heiskanen@gmail.com>:
> Hi,
> But I still need to resolve the latest checkpoint and pass that as an argument. My question
still is that why all this can not be handled by Flink core? Why not just have parameters
enable savepoints, location of savepoints and state backend and system would then automatically
do checkpoints / savepoints on exit and also start from the first available checkpoint?
> Br,
> Henkka
> On Thu, Aug 10, 2017 at 3:15 PM, Stefan Richter <s.richter@data-artisans.com <mailto:s.richter@data-artisans.com>>
> Hi,
> but I think this is exactly the case for externalized checkpoints. Periodic savepoints
are problematic because, their lifecycle is meant to be under the control of the user and
Flink can not make any assumptions when they can be dropped. So in the conservative scenario,
savepoints would quickly pile up. With externalized checkpoints, you can control the number
of retained checkpoints. if you set this number to one, that should be exactly what you want.
> As for rescalability, this limitation is more of a future than a current problem. Right
now, you should be able to rescale from all externalized checkpoints. But this might not hold
in the future, because you can optimize checkpoints in some cases if this is feature dropped.
> Right now, externalized checkpoints should offer all that you want.
> Best,
> Stefan
>> Am 10.08.2017 um 11:46 schrieb Henri Heiskanen <henri.heiskanen@gmail.com <mailto:henri.heiskanen@gmail.com>>:
>> Hi,
>> It would be super helpful if Flink would provide out of the box functionality for
writing automatic savepoints and then starting from the latest savepoint. If external checkpoints
would support rescaling then 1st requirement is met, but one would still need to e.g. find
the latest checkpoint from some folder and pass that as argument. We are currently writing
our own functionality for this. Why not just tell Flink that this job uses persistent states
and default functionality is then to start from the latest snapshot.
>> Br,
>> Henri H
>> On Thu, Aug 10, 2017 at 11:20 AM, Stefan Richter <s.richter@data-artisans.com
<mailto:s.richter@data-artisans.com>> wrote:
>> Hi,
>> I would explain the main conceptual difference as follows:
>> - Checkpoints are periodically triggered by the system for fault tolerance. They
are used to automatically recover from failures. Because of their automatic and periodical
nature, they should be lightweight to produce and will restore the same job without any changes
to the jobgraph, parallelism, etc. Checkpoints are usually dropped after the job was terminated
by the user.
>> - Savepoints are triggered by the user to store the state of the job for a manual
resume and backup. Savepoints are usually not periodical but typically taken before some user
actions to the job or the system. For example, this could be an update of your Flink version,
changing your job graph, changing parallelism, forking a second job like for a red/blue deployment,
and so on.  Of course, savepoints must survive job termination. Conceptually, savepoints can
be a bit more expensive to produce, because they should have a format that makes all those
„changes to the job“ features possible.
>> Besides this conceptual difference, the current implementations are basically using
the same code and produce the same „format". However, there is currently one exception from
this, but I would expect more differences in the future. This exception are incremental checkpoints
with the RocksDB state backend. They are using some RocksDB internal format instead of Flink’s
„savepoint format“. This makes them the first instance of a more lightweight checkpointing
mechanism, compared to savepoints, at the cost of dropping support for certain features such
as changing the parallelism.
>> Furthermore, there also exists „externalized checkpoints“, which are somewhere
in between checkpoints and savepoints. They are triggered by Flink, but can survive job termination
and can then be used by the user to restart the job, similar to savepoints. They use the checkpointing
code path, so there are for example externalized incremental checkpoints. However, exactly
like a normal checkpoints, they might also lack certain features like rescalability.
>> Best,
>> Stefan
>>> Am 10.08.2017 um 05:32 schrieb Raja.Aravapalli <Raja.Aravapalli@target.com
>>> Hi,
>>> Can someone please help me understand the difference between Flink's Checkpoints
& Savepoints.
>>> While I read the documentation, couldn't understand the difference! :s
>>> Thanks a lot. 
>>> Regards,
>>> Raja.

View raw message