Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 885C7200BAF for ; Mon, 31 Oct 2016 16:29:30 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 86E9C160B05; Mon, 31 Oct 2016 15:29:30 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id A6310160AED for ; Mon, 31 Oct 2016 16:29:29 +0100 (CET) Received: (qmail 18592 invoked by uid 500); 31 Oct 2016 15:29:28 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 18583 invoked by uid 99); 31 Oct 2016 15:29:28 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 31 Oct 2016 15:29:28 +0000 Received: from mail-yb0-f171.google.com (mail-yb0-f171.google.com [209.85.213.171]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 6B81D1A0515 for ; Mon, 31 Oct 2016 15:29:28 +0000 (UTC) Received: by mail-yb0-f171.google.com with SMTP id d128so60415023ybh.2 for ; Mon, 31 Oct 2016 08:29:28 -0700 (PDT) X-Gm-Message-State: ABUngvf50/GIRrUr9d0GshQC9ha4ljFjDJUPmmb9+KQFfGPK52jeBiFY0+ElgyegMYfg3T1PLBK2PJuM+ShNRg== X-Received: by 10.36.18.196 with SMTP id 187mr8906334itp.118.1477927767531; Mon, 31 Oct 2016 08:29:27 -0700 (PDT) MIME-Version: 1.0 Received: by 10.107.160.207 with HTTP; Mon, 31 Oct 2016 08:29:27 -0700 (PDT) In-Reply-To: References: From: Stephan Ewen Date: Mon, 31 Oct 2016 16:29:27 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: Flink on YARN - Fault Tolerance | use case supported or not To: user@flink.apache.org Cc: Maximilian Michels Content-Type: multipart/alternative; boundary=001a1144c69a9691bb05402adfe1 archived-at: Mon, 31 Oct 2016 15:29:30 -0000 --001a1144c69a9691bb05402adfe1 Content-Type: text/plain; charset=UTF-8 Hi Anchit! In high-availability cases, a Flink cluster recovers jobs that it considers belonging to the cluster. That is determined by what is set in the Zookeeper Cluster Namespace: "recovery.zookeeper.path.namespace" https://github.com/apache/flink/blob/release-1.1.3/flink-core/src/main/java/org/apache/flink/configuration/ConfigConstants.java#L646 If you submit the job in the "per-job-yarn" mode (via 'bin/flink run -m yarn-cluster ...') then this gets a unique auto-generated namespace. The assumption is that the job recovers itself as long as the yarn job keeps running. If you force yarn to terminate the job, it is gone. If you start a "yarn session", then it picks up the namespace from the config. If you kill that yarn session while jobs are running, and then start a new session with the same namespace, it will start recovering the previously running jobs. Does that make sense? Greetings, Stephan On Mon, Oct 31, 2016 at 4:17 PM, Kostas Kloudas wrote: > Hi Jatana, > > As you pointed out, the correct way to do the above is to use savepoints. > If you kill your application, then this is not a crass but rather a > voluntary action. > > I am also looping in Max, as he may have something more to say on this. > > Cheers, > Kostas > > On Sat, Oct 29, 2016 at 12:13 AM, Anchit Jatana < > development.anchit@gmail.com> wrote: > >> Hi All, >> >> I tried testing fault tolerance in a different way(not sure if it as >> appropriate way) of my running flink application. I ran the flink >> application on YARN and after completing few checkpoints, killed the YARN >> application using: >> >> yarn application -kill application_1476277440022_xxxx >> >> Further, tried restarting the application by providing the same path of >> the checkpointing directory. The application started afresh and did not >> resume from the last check-pointed state. Just wanted to make sure if fault >> tolerance in this usecase is valid or not. If yes, what am I doing wrong? >> >> I'm aware of the savepoint process- to create savepoint, stop the >> application and resume new application from the same savepoint but wished >> to check the above usecase considering the fact that for some reason if the >> YARN application gets killed perhaps accidentally or due to any other >> reason, is this kind of fault tolerance supported or not. >> >> >> Regards, >> Anchit >> > > --001a1144c69a9691bb05402adfe1 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
Hi Anchit!

In high-availability cases, = a Flink cluster recovers jobs that it considers belonging to the cluster. T= hat is determined by what is set in the Zookeeper Cluster Namespace:=C2=A0&= quot;recovery.zookeeper.path.namespace"

If you su= bmit the job in the "per-job-yarn" mode (via 'bin/flink run -= m yarn-cluster ...') then this gets a unique auto-generated namespace. = The assumption is that the job recovers itself as long as the yarn job keep= s running. If you force yarn to terminate the job, it is gone.

If you start a "yarn session", then it picks up = the namespace from the config. If you kill that yarn session while jobs are= running, and then start a new session with the same namespace, it will sta= rt recovering the previously running jobs.

Does th= at make sense?

Greetings,
Stephan
<= div>


On Mon, Oct 31, 2016 at 4:17 PM, Kostas Kloudas <k.kloudas@data-artisans.com> wrote:
Hi Jatana,

As you pointed= out, the correct way to do the above is to use savepoints.
If yo= u kill your application, then this is not a crass but rather a voluntary ac= tion.

I am also looping in Max, as he may have som= ething more to say on this.

Cheers,
Kost= as

On Sat, Oct 29, 2016 at 12:13 AM, Anch= it Jatana <development.anchit@gmail.com> wrote:
Hi All,

I tried testing fault tolerance in a different way(not sure if it as appr= opriate way) of my running flink application. I ran the flink application o= n YARN and after completing few checkpoints, killed the YARN application us= ing:=C2=A0

yarn application -kill=C2=A0application= _1476277440022_xxxx

Further, tried restarting= the application by providing the same path of the checkpointing directory.= The application started afresh and did not resume from the last check-poin= ted state. Just wanted to make sure if fault tolerance in this usecase is v= alid or not. If yes, what am I doing wrong?

I'= m aware of the savepoint process- to create savepoint, stop the application= and resume new application from the same savepoint but wished to check the= above usecase considering the fact that for some reason if the YARN applic= ation gets killed perhaps accidentally or due to any other reason, is this = kind of fault tolerance supported or not.


Regards,
Anchit


--001a1144c69a9691bb05402adfe1--