flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Ufuk Celebi <...@apache.org>
Subject Re: Redeployements and state
Date Tue, 26 Jan 2016 10:06:57 GMT
Hey Niels!

Stephan gave a very good summary of the current state of things. What do you think of the
outlined stop with savepoint method?

Regarding the broken links: I’ve fixed various broken links in the master docs yesterday.
If you encounter something again, feel free to post it to the ML or open a JIRA for it.

– Ufuk

> On 25 Jan 2016, at 16:21, Stephan Ewen <sewen@apache.org> wrote:
> 
> Hi Niels!
> 
> There is a slight mismatch between your thoughts and the current design, but not much.
> 
> What you describe (at the start of the job, the latest checkpoint is automatically loaded)
is basically what the high-availability setup does if the master dies. The new master loads
all jobs and continues them from the latest checkpoint.
> If you run an HA setup, and you stop/restart your jobs not by stopping the jobs, but
by killing the cluster, you should get that behavior.
> 
> Once a job is properly stopped, and you start a new job, there is no way for Flink to
tell that this is in fact the same job and it should resume from where the recently stopped.
Also, "same" should be a fuzzy "same", to allow for slight changes in the job (bug fixes).
Safepoints let you put the persistent part of the job somewhere, to tell a new job where to
pick up from.
>   - Makes it work in non-HA setups
>   - Allows you to keep multiple savepoint (like "versions", say one per day or so) to
roll back to
>   - Can have multiple versions of the same jobs resuming from one savepoint (what-if
or A/B tests, or seamless version upgrades)
> 
> 
> There is something on the roadmap that would make your use case very easy: "StopWithSavepoint"
> 
> There is an open pull request to cleanly stop() a streaming program. The next enhancement
is to stop it and let it draw a savepoint as part of that. Then you can simply script a stop/start
like that:
> 
> # stop with savepoint
> bin/flink stop -s <random-directory> jobid
> 
> # resume
> bin/flink run -s <random-directory> job
> 
> 
> Hope that helps,
> Stephan
> 
> 
> On Fri, Jan 22, 2016 at 3:06 PM, Niels Basjes <Niels@basjes.nl> wrote:
> Hi,
> 
> @Max: Thanks for the new URL. I noticed that a lot (in fact almost all) of links in the
new manuals lead to 404 errors. Maybe you should run an automated test to find them all.
> 
> I did a bit of reading about the savepoints and that in fact they are written as "Allow
to trigger checkpoints manually".
> 
> Let me sketch what I think I need:
> 1) I need recovery of the topology in case of partial failure (i.e. a single node dies).
> 2) I need recovery of the topology in case of full topology failure (i.e. Hadoop security
tokens cause the entire thing to die, or I need to deploy a fixed version of my software).
> 
> Now what I understand is that the checkpoints are managed by Flink and as such allow
me to run the topology without any manual actions. These are cleaned automatically when no
longer needed.
> These savepoints however appear to need external 'intervention'; they are intended as
'manual'. So in addition to my topology I need something extra that periodically (i.e. every
minute) fires a command to persist a checkpoint into a savepoint and to cleanup the 'old'
ones.
> 
> What I want is something that works roughly as follows:
> 1) I configure everything (i.e. assign Ids configure the checkpoint directory, etc.)
> 2) The framework saves and cleans the checkpoints automatically when the topology is
running.
> 3) I simply start the topology without any special options.
> 
> My idea is essentially that at the startup of a topology the system looks at the configured
checkpoint persistance and recovers the most recent one.
> 
> Apparently there is a mismatch between what I think is useful and what has been implemented
so far. 
> Am I missing something or should I submit this as a Jira ticket for a later version?
> 
> Niels Basjes
> 
> 
> 
> 
> 
> 
> On Mon, Jan 18, 2016 at 12:13 PM, Maximilian Michels <mxm@apache.org> wrote:
> The documentation layout changed in the master. Then new URL:
> https://ci.apache.org/projects/flink/flink-docs-master/apis/streaming/savepoints.html
> 
> On Thu, Jan 14, 2016 at 2:21 PM, Niels Basjes <Niels@basjes.nl> wrote:
> > Yes, that is exactly the type of solution I was looking for.
> >
> > I'll dive into this.
> > Thanks guys!
> >
> > Niels
> >
> > On Thu, Jan 14, 2016 at 11:55 AM, Ufuk Celebi <uce@apache.org> wrote:
> >>
> >> Hey Niels,
> >>
> >> as Gabor wrote, this feature has been merged to the master branch
> >> recently.
> >>
> >> The docs are online here:
> >> https://ci.apache.org/projects/flink/flink-docs-master/apis/savepoints.html
> >>
> >> Feel free to report back your experience with it if you give it a try.
> >>
> >> – Ufuk
> >>
> >> > On 14 Jan 2016, at 11:09, Gábor Gévay <ggab90@gmail.com> wrote:
> >> >
> >> > Hello,
> >> >
> >> > You are probably looking for this feature:
> >> > https://issues.apache.org/jira/browse/FLINK-2976
> >> >
> >> > Best,
> >> > Gábor
> >> >
> >> >
> >> >
> >> >
> >> > 2016-01-14 11:05 GMT+01:00 Niels Basjes <Niels@basjes.nl>:
> >> >> Hi,
> >> >>
> >> >> I'm working on a streaming application using Flink.
> >> >> Several steps in the processing are state-full (I use custom Windows
> >> >> and
> >> >> state-full operators ).
> >> >>
> >> >> Now if during a normal run an worker fails the checkpointing system
> >> >> will be
> >> >> used to recover.
> >> >>
> >> >> But what if the entire application is stopped (deliberately) or
> >> >> stops/fails
> >> >> because of a problem?
> >> >>
> >> >> At this moment I have three main reasons/causes for doing this:
> >> >> 1) The application just dies because of a bug on my side or a problem
> >> >> like
> >> >> for example this (which I'm actually confronted with):  Failed to
> >> >> Update
> >> >> HDFS Delegation Token for long running application in HA mode
> >> >> https://issues.apache.org/jira/browse/HDFS-9276
> >> >> 2) I need to rebalance my application (i.e. stop, change parallelism,
> >> >> start)
> >> >> 3) I need a new version of my software to be deployed. (i.e. I fixed
a
> >> >> bug,
> >> >> changed the topology and need to continue)
> >> >>
> >> >> I assume the solution will be in some part be specific for my
> >> >> application.
> >> >> The question is what features exist in Flink to support such a clean
> >> >> "continue where I left of" scenario?
> >> >>
> >> >> --
> >> >> Best regards / Met vriendelijke groeten,
> >> >>
> >> >> Niels Basjes
> >>
> >
> >
> >
> > --
> > Best regards / Met vriendelijke groeten,
> >
> > Niels Basjes
> 
> 
> 
> -- 
> Best regards / Met vriendelijke groeten,
> 
> Niels Basjes
> 


Mime
View raw message