flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: long lived standalone job session cluster in kubernetes
Date Mon, 11 Feb 2019 08:26:26 GMT
Hi Heath,

I just learned that people from Alibaba already made some good progress
with FLINK-9953. I'm currently talking to them in order to see how we can
merge this contribution into Flink as fast as possible. Since I'm quite
busy due to the upcoming release I hope that other community members will
help out with the reviewing once the PRs are opened.

Cheers,
Till

On Fri, Feb 8, 2019 at 8:50 PM Heath Albritton <halbritt@harm.org> wrote:

> Has any progress been made on this?  There are a number of folks in
> the community looking to help out.
>
>
> -H
>
> On Wed, Dec 5, 2018 at 10:00 AM Till Rohrmann <trohrmann@apache.org>
> wrote:
> >
> > Hi Derek,
> >
> > there is this issue [1] which tracks the active Kubernetes integration.
> Jin Sun already started implementing some parts of it. There should also be
> some PRs open for it. Please check them out.
> >
> > [1] https://issues.apache.org/jira/browse/FLINK-9953
> >
> > Cheers,
> > Till
> >
> > On Wed, Dec 5, 2018 at 6:39 PM Derek VerLee <derekverlee@gmail.com>
> wrote:
> >>
> >> Sounds good.
> >>
> >> Is someone working on this automation today?
> >>
> >> If not, although my time is tight, I may be able to work on a PR for
> getting us started down the path Kubernetes native cluster mode.
> >>
> >>
> >> On 12/4/18 5:35 AM, Till Rohrmann wrote:
> >>
> >> Hi Derek,
> >>
> >> what I would recommend to use is to trigger the cancel with savepoint
> command [1]. This will create a savepoint and terminate the job execution.
> Next you simply need to respawn the job cluster which you provide with the
> savepoint to resume from.
> >>
> >> [1]
> https://ci.apache.org/projects/flink/flink-docs-stable/ops/state/savepoints.html#cancel-job-with-savepoint
> >>
> >> Cheers,
> >> Till
> >>
> >> On Tue, Dec 4, 2018 at 10:30 AM Andrey Zagrebin <
> andrey@data-artisans.com> wrote:
> >>>
> >>> Hi Derek,
> >>>
> >>> I think your automation steps look good.
> >>> Recreating deployments should not take long
> >>> and as you mention, this way you can avoid unpredictable old/new
> version collisions.
> >>>
> >>> Best,
> >>> Andrey
> >>>
> >>> > On 4 Dec 2018, at 10:22, Dawid Wysakowicz <dwysakowicz@apache.org>
> wrote:
> >>> >
> >>> > Hi Derek,
> >>> >
> >>> > I am not an expert in kubernetes, so I will cc Till, who should be
> able
> >>> > to help you more.
> >>> >
> >>> > As for the automation for similar process I would recommend having
a
> >>> > look at dA platform[1] which is built on top of kubernetes.
> >>> >
> >>> > Best,
> >>> >
> >>> > Dawid
> >>> >
> >>> > [1] https://data-artisans.com/platform-overview
> >>> >
> >>> > On 30/11/2018 02:10, Derek VerLee wrote:
> >>> >>
> >>> >> I'm looking at the job cluster mode, it looks great and I and
> >>> >> considering migrating our jobs off our "legacy" session cluster
and
> >>> >> into Kubernetes.
> >>> >>
> >>> >> I do need to ask some questions because I haven't found a lot of
> >>> >> details in the documentation about how it works yet, and I gave
up
> >>> >> following the the DI around in the code after a while.
> >>> >>
> >>> >> Let's say I have a deployment for the job "leader" in HA with ZK,
> and
> >>> >> another deployment for the taskmanagers.
> >>> >>
> >>> >> I want to upgrade the code or configuration and start from a
> >>> >> savepoint, in an automated way.
> >>> >>
> >>> >> Best I can figure, I can not just update the deployment resources
in
> >>> >> kubernetes and allow the containers to restart in an arbitrary
> order.
> >>> >>
> >>> >> Instead, I expect sequencing is important, something along the
lines
> >>> >> of this:
> >>> >>
> >>> >> 1. issue savepoint command on leader
> >>> >> 2. wait for savepoint
> >>> >> 3. destroy all leader and taskmanager containers
> >>> >> 4. deploy new leader, with savepoint url
> >>> >> 5. deploy new taskmanagers
> >>> >>
> >>> >>
> >>> >> For example, I imagine old taskmanagers (with an old version of
my
> >>> >> job) attaching to the new leader and causing a problem.
> >>> >>
> >>> >> Does that sound right, or am I overthinking it?
> >>> >>
> >>> >> If not, has anyone tried implementing any automation for this yet?
> >>> >>
> >>> >
> >>>
>

Mime
View raw message