Return-Path: X-Original-To: apmail-flink-user-archive@minotaur.apache.org Delivered-To: apmail-flink-user-archive@minotaur.apache.org Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by minotaur.apache.org (Postfix) with SMTP id 5512618352 for ; Thu, 19 Nov 2015 09:56:21 +0000 (UTC) Received: (qmail 73126 invoked by uid 500); 19 Nov 2015 09:56:21 -0000 Delivered-To: apmail-flink-user-archive@flink.apache.org Received: (qmail 73043 invoked by uid 500); 19 Nov 2015 09:56:21 -0000 Mailing-List: contact user-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: user@flink.apache.org Delivered-To: mailing list user@flink.apache.org Received: (qmail 73034 invoked by uid 99); 19 Nov 2015 09:56:21 -0000 Received: from mail-relay.apache.org (HELO mail-relay.apache.org) (140.211.11.15) by apache.org (qpsmtpd/0.29) with ESMTP; Thu, 19 Nov 2015 09:56:21 +0000 Received: from mail-lb0-f172.google.com (mail-lb0-f172.google.com [209.85.217.172]) by mail-relay.apache.org (ASF Mail Server at mail-relay.apache.org) with ESMTPSA id 624751A006D for ; Thu, 19 Nov 2015 09:56:20 +0000 (UTC) Received: by lbbsy6 with SMTP id sy6so40218673lbb.2 for ; Thu, 19 Nov 2015 01:56:18 -0800 (PST) X-Received: by 10.112.144.225 with SMTP id sp1mr2767073lbb.97.1447926978805; Thu, 19 Nov 2015 01:56:18 -0800 (PST) MIME-Version: 1.0 Received: by 10.112.72.227 with HTTP; Thu, 19 Nov 2015 01:55:59 -0800 (PST) In-Reply-To: <6DAD18F2-8F82-457D-9378-87A640151884@apache.org> References: <6DAD18F2-8F82-457D-9378-87A640151884@apache.org> From: Robert Metzger Date: Thu, 19 Nov 2015 10:55:59 +0100 X-Gmail-Original-Message-ID: Message-ID: Subject: Re: YARN High Availability To: "user@flink.apache.org" Content-Type: multipart/alternative; boundary=047d7b3a888c3b89b70524e1c513 --047d7b3a888c3b89b70524e1c513 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: quoted-printable I agree with Aljoscha. Many companies install Flink (and its config) in a central directory and users share that installation. On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek wrote: > I think we should find a way to randomize the paths where the HA stuff > stores data. If users don=E2=80=99t realize that they store data in the s= ame paths > this could lead to problems. > > > On 19 Nov 2015, at 08:50, Till Rohrmann wrote: > > > > Hi Gwenha=C3=ABl, > > > > good to hear that you could resolve the problem. > > > > When you run multiple HA flink jobs in the same cluster, then you don= =E2=80=99t > have to adjust the configuration of Flink. It should work out of the box. > > > > However, if you run multiple HA Flink cluster, then you have to set for > each cluster a distinct ZooKeeper root path via the option > recovery.zookeeper.path.root in the Flink configuraiton. This is necessar= y > because otherwise all JobManagers (the ones of the different clusters) wi= ll > compete for a single leadership. Furthermore, all TaskManagers will only > see the one and only leader and connect to it. The reason is that the > TaskManagers will look up their leader at a ZNode below the ZooKeeper roo= t > path. > > > > If you have other questions then don=E2=80=99t hesitate asking me. > > > > Cheers, > > Till > > > > > > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers < > gwenhael.pasquiers@ericsson.com> wrote: > > Nevermind, > > > > > > > > Looking at the logs I saw that it was having issues trying to connect t= o > ZK. > > > > To make I short is had the wrong port. > > > > > > > > It is now starting. > > > > > > > > Tomorrow I=E2=80=99ll try to kill some JobManagers *evil*. > > > > > > > > Another question : if I have multiple HA flink jobs, are there some > points to check in order to be sure that they won=E2=80=99t collide on hd= fs or ZK ? > > > > > > > > B.R. > > > > > > > > Gwenha=C3=ABl PASQUIERS > > > > > > > > From: Till Rohrmann [mailto:till.rohrmann@gmail.com] > > Sent: mercredi 18 novembre 2015 18:01 > > To: user@flink.apache.org > > Subject: Re: YARN High Availability > > > > > > > > Hi Gwenha=C3=ABl, > > > > > > > > do you have access to the yarn logs? > > > > > > > > Cheers, > > > > Till > > > > > > > > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers < > gwenhael.pasquiers@ericsson.com> wrote: > > > > Hello, > > > > > > > > We=E2=80=99re trying to set up high availability using an existing zook= eeper > quorum already running in our Cloudera cluster. > > > > > > > > So, as per the doc we=E2=80=99ve changed the max attempt in yarn=E2=80= =99s config as > well as the flink.yaml. > > > > > > > > recovery.mode: zookeeper > > > > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181 > > > > state.backend: filesystem > > > > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints > > > > recovery.zookeeper.storageDir: hdfs:///flink/recovery/ > > > > yarn.application-attempts: 1000 > > > > > > > > Everything is ok as long as recovery.mode is commented. > > > > As soon as I uncomment recovery.mode the deployment on yarn is stuck on= : > > > > > > > > =E2=80=9CDeploying cluster, current state ACCEPTED=E2=80=9D. > > > > =E2=80=9CDeployment took more than 60 seconds=E2=80=A6.=E2=80=9D > > > > Every second. > > > > > > > > And I have more than enough resources available on my yarn cluster. > > > > > > > > Do you have any idea of what could cause this, and/or what logs I shoul= d > look for in order to understand ? > > > > > > > > B.R. > > > > > > > > Gwenha=C3=ABl PASQUIERS > > > > > > > > > > --047d7b3a888c3b89b70524e1c513 Content-Type: text/html; charset=UTF-8 Content-Transfer-Encoding: quoted-printable
I agree with Aljoscha. Many companies install Flink (and i= ts config) in a central directory and users share that installation.
<= div class=3D"gmail_extra">
On Thu, Nov 19, 20= 15 at 10:45 AM, Aljoscha Krettek <aljoscha@apache.org> wro= te:
I think we should find a way to rando= mize the paths where the HA stuff stores data. If users don=E2=80=99t reali= ze that they store data in the same paths this could lead to problems.

> On 19 Nov 2015, at 08:50, Till Rohrmann <trohrmann@apache.org> wrote:
>
> Hi Gwenha=C3=ABl,
>
> good to hear that you could resolve the problem.
>
> When you run multiple HA flink jobs in the same cluster, then you don= =E2=80=99t have to adjust the configuration of Flink. It should work out of= the box.
>
> However, if you run multiple HA Flink cluster, then you have to set fo= r each cluster a distinct ZooKeeper root path via the option recovery.zooke= eper.path.root in the Flink configuraiton. This is necessary because otherw= ise all JobManagers (the ones of the different clusters) will compete for a= single leadership. Furthermore, all TaskManagers will only see the one and= only leader and connect to it. The reason is that the TaskManagers will lo= ok up their leader at a ZNode below the ZooKeeper root path.
>
> If you have other questions then don=E2=80=99t hesitate asking me.
>
> Cheers,
> Till
>
>
> On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>= ; wrote:
> Nevermind,
>
>
>
> Looking at the logs I saw that it was having issues trying to connect = to ZK.
>
> To make I short is had the wrong port.
>
>
>
> It is now starting.
>
>
>
> Tomorrow I=E2=80=99ll try to kill some JobManagers *evil*.
>
>
>
> Another question : if I have multiple HA flink jobs, are there some po= ints to check in order to be sure that they won=E2=80=99t collide on hdfs o= r ZK ?
>
>
>
> B.R.
>
>
>
> Gwenha=C3=ABl PASQUIERS
>
>
>
> From: Till Rohrmann [mailto:till.rohrmann@gmail.com]
> Sent: mercredi 18 novembre 2015 18:01
> To: user@flink.apache.org=
> Subject: Re: YARN High Availability
>
>
>
> Hi Gwenha=C3=ABl,
>
>
>
> do you have access to the yarn logs?
>
>
>
> Cheers,
>
> Till
>
>
>
> On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>= ; wrote:
>
> Hello,
>
>
>
> We=E2=80=99re trying to set up high availability using an existing zoo= keeper quorum already running in our Cloudera cluster.
>
>
>
> So, as per the doc we=E2=80=99ve changed the max attempt in yarn=E2=80= =99s config as well as the flink.yaml.
>
>
>
> recovery.mode: zookeeper
>
> recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
>
> state.backend: filesystem
>
> state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
>
> recovery.zookeeper.storageDir: hdfs:///flink/recovery/
>
> yarn.application-attempts: 1000
>
>
>
> Everything is ok as long as recovery.mode is commented.
>
> As soon as I uncomment recovery.mode the deployment on yarn is stuck o= n :
>
>
>
> =E2=80=9CDeploying cluster, current state ACCEPTED=E2=80=9D.
>
> =E2=80=9CDeployment took more than 60 seconds=E2=80=A6.=E2=80=9D
>
> Every second.
>
>
>
> And I have more than enough resources available on my yarn cluster. >
>
>
> Do you have any idea of what could cause this, and/or what logs I shou= ld look for in order to understand ?
>
>
>
> B.R.
>
>
>
> Gwenha=C3=ABl PASQUIERS
>
>
>
>


--047d7b3a888c3b89b70524e1c513--