flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Stephan Ewen <se...@apache.org>
Subject Re: Redeployements and state
Date Mon, 25 Jan 2016 15:21:46 GMT
Hi Niels!

There is a slight mismatch between your thoughts and the current design,
but not much.

What you describe (at the start of the job, the latest checkpoint is
automatically loaded) is basically what the high-availability setup does if
the master dies. The new master loads all jobs and continues them from the
latest checkpoint.
If you run an HA setup, and you stop/restart your jobs not by stopping the
jobs, but by killing the cluster, you should get that behavior.

Once a job is properly stopped, and you start a new job, there is no way
for Flink to tell that this is in fact the same job and it should resume
from where the recently stopped. Also, "same" should be a fuzzy "same", to
allow for slight changes in the job (bug fixes). Safepoints let you put the
persistent part of the job somewhere, to tell a new job where to pick up
from.
  - Makes it work in non-HA setups
  - Allows you to keep multiple savepoint (like "versions", say one per day
or so) to roll back to
  - Can have multiple versions of the same jobs resuming from one savepoint
(what-if or A/B tests, or seamless version upgrades)


There is something on the roadmap that would make your use case very easy:
"StopWithSavepoint"

There is an open pull request to cleanly stop() a streaming program. The
next enhancement is to stop it and let it draw a savepoint as part of that.
Then you can simply script a stop/start like that:

# stop with savepoint
bin/flink stop -s <random-directory> jobid

# resume
bin/flink run -s <random-directory> job


Hope that helps,
Stephan


On Fri, Jan 22, 2016 at 3:06 PM, Niels Basjes <Niels@basjes.nl> wrote:

> Hi,
>
> @Max: Thanks for the new URL. I noticed that a lot (in fact almost all) of
> links in the new manuals lead to 404 errors. Maybe you should run an
> automated test to find them all.
>
> I did a bit of reading about the savepoints and that in fact they are
> written as "Allow to trigger checkpoints manually".
>
> Let me sketch what I think I need:
> 1) I need recovery of the topology in case of partial failure (i.e. a
> single node dies).
> 2) I need recovery of the topology in case of full topology failure (i.e.
> Hadoop security tokens cause the entire thing to die, or I need to deploy a
> fixed version of my software).
>
> Now what I understand is that the checkpoints are managed by Flink and as
> such allow me to run the topology without any manual actions. These are
> cleaned automatically when no longer needed.
> These savepoints however appear to need external 'intervention'; they are
> intended as 'manual'. So in addition to my topology I need something extra
> that periodically (i.e. every minute) fires a command to persist a
> checkpoint into a savepoint and to cleanup the 'old' ones.
>
> What I want is something that works roughly as follows:
> 1) I configure everything (i.e. assign Ids configure the checkpoint
> directory, etc.)
> 2) The framework saves and cleans the checkpoints automatically when the
> topology is running.
> 3) I simply start the topology without any special options.
>
> My idea is essentially that at the startup of a topology the system looks
> at the configured checkpoint persistance and recovers the most recent one.
>
> Apparently there is a mismatch between what I think is useful and what has
> been implemented so far.
> Am I missing something or should I submit this as a Jira ticket for a
> later version?
>
> Niels Basjes
>
>
>
>
>
>
> On Mon, Jan 18, 2016 at 12:13 PM, Maximilian Michels <mxm@apache.org>
> wrote:
>
>> The documentation layout changed in the master. Then new URL:
>>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
>>
>> On Thu, Jan 14, 2016 at 2:21 PM, Niels Basjes <Niels@basjes.nl> wrote:
>> > Yes, that is exactly the type of solution I was looking for.
>> >
>> > I'll dive into this.
>> > Thanks guys!
>> >
>> > Niels
>> >
>> > On Thu, Jan 14, 2016 at 11:55 AM, Ufuk Celebi <uce@apache.org> wrote:
>> >>
>> >> Hey Niels,
>> >>
>> >> as Gabor wrote, this feature has been merged to the master branch
>> >> recently.
>> >>
>> >> The docs are online here:
>> >>
>> https://ci.apache.org/projects/flink/flink-docs-master/apis/savepoints.html
>> >>
>> >> Feel free to report back your experience with it if you give it a try.
>> >>
>> >> – Ufuk
>> >>
>> >> > On 14 Jan 2016, at 11:09, Gábor Gévay <ggab90@gmail.com> wrote:
>> >> >
>> >> > Hello,
>> >> >
>> >> > You are probably looking for this feature:
>> >> > https://issues.apache.org/jira/browse/FLINK-2976
>> >> >
>> >> > Best,
>> >> > Gábor
>> >> >
>> >> >
>> >> >
>> >> >
>> >> > 2016-01-14 11:05 GMT+01:00 Niels Basjes <Niels@basjes.nl>:
>> >> >> Hi,
>> >> >>
>> >> >> I'm working on a streaming application using Flink.
>> >> >> Several steps in the processing are state-full (I use custom Windows
>> >> >> and
>> >> >> state-full operators ).
>> >> >>
>> >> >> Now if during a normal run an worker fails the checkpointing system
>> >> >> will be
>> >> >> used to recover.
>> >> >>
>> >> >> But what if the entire application is stopped (deliberately) or
>> >> >> stops/fails
>> >> >> because of a problem?
>> >> >>
>> >> >> At this moment I have three main reasons/causes for doing this:
>> >> >> 1) The application just dies because of a bug on my side or a
>> problem
>> >> >> like
>> >> >> for example this (which I'm actually confronted with):  Failed
to
>> >> >> Update
>> >> >> HDFS Delegation Token for long running application in HA mode
>> >> >> https://issues.apache.org/jira/browse/HDFS-9276
>> >> >> 2) I need to rebalance my application (i.e. stop, change
>> parallelism,
>> >> >> start)
>> >> >> 3) I need a new version of my software to be deployed. (i.e. I
>> fixed a
>> >> >> bug,
>> >> >> changed the topology and need to continue)
>> >> >>
>> >> >> I assume the solution will be in some part be specific for my
>> >> >> application.
>> >> >> The question is what features exist in Flink to support such a
clean
>> >> >> "continue where I left of" scenario?
>> >> >>
>> >> >> --
>> >> >> Best regards / Met vriendelijke groeten,
>> >> >>
>> >> >> Niels Basjes
>> >>
>> >
>> >
>> >
>> > --
>> > Best regards / Met vriendelijke groeten,
>> >
>> > Niels Basjes
>>
>
>
>
> --
> Best regards / Met vriendelijke groeten,
>
> Niels Basjes
>

Mime
View raw message