flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stefan Richter <s.rich...@ververica.com>
Subject Re: Fast restart of a job with a large state
Date Thu, 18 Apr 2019 10:22:27 GMT

If rescaling is the problem, let me clarify that you can currently rescale from savepoints
and all types of checkpoints (including incremental). If that was the only problem, then there
is nothing to worry about - the documentation is only a bit conservative about this because
we will not commit to an APU that all future types checkpoints will be resealable. But currently
they are all, and this is also very unlikely to change anytime soon.

Paul, just to comment on your suggestion as well, local recovery would only help with failover.
1) It does not help for restarts by the user and 2) also does not work for rescaling (2) is
a consequence of 1) because failover never rescales, only restarts).


> On 18. Apr 2019, at 12:07, Paul Lam <paullin3280@gmail.com> wrote:
> The URL in my previous mail is wrong, and it should be: 
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/large_state_tuning.html#task-local-recovery
> Best,
> Paul Lam
>> 在 2019年4月18日,18:04,Paul Lam <paullin3280@gmail.com <mailto:paullin3280@gmail.com>>
>> Hi,
>> Have you tried task local recovery [1]?
>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints
>> Best,
>> Paul Lam
>>> 在 2019年4月17日,17:46,Sergey Zhemzhitsky <szh.subs@gmail.com <mailto:szh.subs@gmail.com>>
>>> Hi Flinkers,
>>> Operating different flink jobs I've discovered that job restarts with
>>> a pretty large state (in my case this is up to 100GB+) take quite a
>>> lot of time. For example, to restart a job (e.g. to update it) the
>>> savepoint is created, and in case of savepoints all the state seems to
>>> be pushed into the distributed store (hdfs in my case) when stopping a
>>> job and pulling this state back when starting the new version of the
>>> job.
>>> What I've found by the moment trying to speed up job restarts is:
>>> - using external retained checkpoints [1]; the drawback is that the
>>> job cannot be rescaled during restart
>>> - using external state and storage with the stateless jobs; the
>>> drawback is the necessity of additional network hops to this storage.
>>> So I'm wondering whether there are any best practices community knows
>>> and uses to cope with the cases like this?
>>> [1] https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/checkpoints.html#retained-checkpoints

View raw message