Mailing-List: contact user-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: user@flink.apache.org
MIME-Version: 1.0
In-Reply-To: <6DAD18F2-8F82-457D-9378-87A640151884@apache.org>
References: <F78F709BB3B3A84AAEFB047B69C62B3A220C9D20@ESESSMB203.ericsson.se>
 <CAC27z=MUx1A5U8nGbd=6R1P9h28zAf3nux=GK8xqVxFodXTRFA@mail.gmail.com>
 <F78F709BB3B3A84AAEFB047B69C62B3A220C9D63@ESESSMB203.ericsson.se>
 <CAC27z=PUaT_rC+wdUxdGzs_TAF3o-c4exya0wYwc7VtoHOw-uA@mail.gmail.com>
 <6DAD18F2-8F82-457D-9378-87A640151884@apache.org>
From: Robert Metzger <rmetzger@apache.org>
Date: Thu, 19 Nov 2015 10:55:59 +0100
Message-ID: 
 <CAGr9p8AxDjyrO30fhr1wnnFjE+RGLP8Oi3yZUyuR_+RfOhSo_g@mail.gmail.com>
Subject: Re: YARN High Availability
To: "user@flink.apache.org" <user@flink.apache.org>
Content-Type: multipart/alternative; boundary=047d7b3a888c3b89b70524e1c513

--047d7b3a888c3b89b70524e1c513
Content-Type: text/plain; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

I agree with Aljoscha. Many companies install Flink (and its config) in a
central directory and users share that installation.

On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljoscha@apache.org>
wrote:

> I think we should find a way to randomize the paths where the HA stuff
> stores data. If users don=E2=80=99t realize that they store data in the s=
ame paths
> this could lead to problems.
>
> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrmann@apache.org> wrote:
> >
> > Hi Gwenha=C3=ABl,
> >
> > good to hear that you could resolve the problem.
> >
> > When you run multiple HA flink jobs in the same cluster, then you don=
=E2=80=99t
> have to adjust the configuration of Flink. It should work out of the box.
> >
> > However, if you run multiple HA Flink cluster, then you have to set for
> each cluster a distinct ZooKeeper root path via the option
> recovery.zookeeper.path.root in the Flink configuraiton. This is necessar=
y
> because otherwise all JobManagers (the ones of the different clusters) wi=
ll
> compete for a single leadership. Furthermore, all TaskManagers will only
> see the one and only leader and connect to it. The reason is that the
> TaskManagers will look up their leader at a ZNode below the ZooKeeper roo=
t
> path.
> >
> > If you have other questions then don=E2=80=99t hesitate asking me.
> >
> > Cheers,
> > Till
> >
> >
> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <
> gwenhael.pasquiers@ericsson.com> wrote:
> > Nevermind,
> >
> >
> >
> > Looking at the logs I saw that it was having issues trying to connect t=
o
> ZK.
> >
> > To make I short is had the wrong port.
> >
> >
> >
> > It is now starting.
> >
> >
> >
> > Tomorrow I=E2=80=99ll try to kill some JobManagers *evil*.
> >
> >
> >
> > Another question : if I have multiple HA flink jobs, are there some
> points to check in order to be sure that they won=E2=80=99t collide on hd=
fs or ZK ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenha=C3=ABl PASQUIERS
> >
> >
> >
> > From: Till Rohrmann [mailto:till.rohrmann@gmail.com]
> > Sent: mercredi 18 novembre 2015 18:01
> > To: user@flink.apache.org
> > Subject: Re: YARN High Availability
> >
> >
> >
> > Hi Gwenha=C3=ABl,
> >
> >
> >
> > do you have access to the yarn logs?
> >
> >
> >
> > Cheers,
> >
> > Till
> >
> >
> >
> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <
> gwenhael.pasquiers@ericsson.com> wrote:
> >
> > Hello,
> >
> >
> >
> > We=E2=80=99re trying to set up high availability using an existing zook=
eeper
> quorum already running in our Cloudera cluster.
> >
> >
> >
> > So, as per the doc we=E2=80=99ve changed the max attempt in yarn=E2=80=
=99s config as
> well as the flink.yaml.
> >
> >
> >
> > recovery.mode: zookeeper
> >
> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
> >
> > state.backend: filesystem
> >
> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
> >
> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/
> >
> > yarn.application-attempts: 1000
> >
> >
> >
> > Everything is ok as long as recovery.mode is commented.
> >
> > As soon as I uncomment recovery.mode the deployment on yarn is stuck on=
 :
> >
> >
> >
> > =E2=80=9CDeploying cluster, current state ACCEPTED=E2=80=9D.
> >
> > =E2=80=9CDeployment took more than 60 seconds=E2=80=A6.=E2=80=9D
> >
> > Every second.
> >
> >
> >
> > And I have more than enough resources available on my yarn cluster.
> >
> >
> >
> > Do you have any idea of what could cause this, and/or what logs I shoul=
d
> look for in order to understand ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenha=C3=ABl PASQUIERS
> >
> >
> >
> >
>
>

--047d7b3a888c3b89b70524e1c513
Content-Type: text/html; charset=UTF-8
Content-Transfer-Encoding: quoted-printable

<div dir=3D"ltr">I agree with Aljoscha. Many companies install Flink (and i=
ts config) in a central directory and users share that installation.</div><=
div class=3D"gmail_extra"><br><div class=3D"gmail_quote">On Thu, Nov 19, 20=
15 at 10:45 AM, Aljoscha Krettek <span dir=3D"ltr">&lt;<a href=3D"mailto:al=
joscha@apache.org" target=3D"_blank">aljoscha@apache.org</a>&gt;</span> wro=
te:<br><blockquote class=3D"gmail_quote" style=3D"margin:0 0 0 .8ex;border-=
left:1px #ccc solid;padding-left:1ex">I think we should find a way to rando=
mize the paths where the HA stuff stores data. If users don=E2=80=99t reali=
ze that they store data in the same paths this could lead to problems.<br>
<div class=3D"HOEnZb"><div class=3D"h5"><br>
&gt; On 19 Nov 2015, at 08:50, Till Rohrmann &lt;<a href=3D"mailto:trohrman=
n@apache.org">trohrmann@apache.org</a>&gt; wrote:<br>
&gt;<br>
&gt; Hi Gwenha=C3=ABl,<br>
&gt;<br>
&gt; good to hear that you could resolve the problem.<br>
&gt;<br>
&gt; When you run multiple HA flink jobs in the same cluster, then you don=
=E2=80=99t have to adjust the configuration of Flink. It should work out of=
 the box.<br>
&gt;<br>
&gt; However, if you run multiple HA Flink cluster, then you have to set fo=
r each cluster a distinct ZooKeeper root path via the option recovery.zooke=
eper.path.root in the Flink configuraiton. This is necessary because otherw=
ise all JobManagers (the ones of the different clusters) will compete for a=
 single leadership. Furthermore, all TaskManagers will only see the one and=
 only leader and connect to it. The reason is that the TaskManagers will lo=
ok up their leader at a ZNode below the ZooKeeper root path.<br>
&gt;<br>
&gt; If you have other questions then don=E2=80=99t hesitate asking me.<br>
&gt;<br>
&gt; Cheers,<br>
&gt; Till<br>
&gt;<br>
&gt;<br>
&gt; On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers &lt;<a href=3D"mai=
lto:gwenhael.pasquiers@ericsson.com">gwenhael.pasquiers@ericsson.com</a>&gt=
; wrote:<br>
&gt; Nevermind,<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Looking at the logs I saw that it was having issues trying to connect =
to ZK.<br>
&gt;<br>
&gt; To make I short is had the wrong port.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; It is now starting.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Tomorrow I=E2=80=99ll try to kill some JobManagers *evil*.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Another question : if I have multiple HA flink jobs, are there some po=
ints to check in order to be sure that they won=E2=80=99t collide on hdfs o=
r ZK ?<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; B.R.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Gwenha=C3=ABl PASQUIERS<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; From: Till Rohrmann [mailto:<a href=3D"mailto:till.rohrmann@gmail.com"=
>till.rohrmann@gmail.com</a>]<br>
&gt; Sent: mercredi 18 novembre 2015 18:01<br>
&gt; To: <a href=3D"mailto:user@flink.apache.org">user@flink.apache.org</a>=
<br>
&gt; Subject: Re: YARN High Availability<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Hi Gwenha=C3=ABl,<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; do you have access to the yarn logs?<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Cheers,<br>
&gt;<br>
&gt; Till<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers &lt;<a href=3D"mai=
lto:gwenhael.pasquiers@ericsson.com">gwenhael.pasquiers@ericsson.com</a>&gt=
; wrote:<br>
&gt;<br>
&gt; Hello,<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; We=E2=80=99re trying to set up high availability using an existing zoo=
keeper quorum already running in our Cloudera cluster.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; So, as per the doc we=E2=80=99ve changed the max attempt in yarn=E2=80=
=99s config as well as the flink.yaml.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; recovery.mode: zookeeper<br>
&gt;<br>
&gt; recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181<br>
&gt;<br>
&gt; state.backend: filesystem<br>
&gt;<br>
&gt; state.backend.fs.checkpointdir: hdfs:///flink/checkpoints<br>
&gt;<br>
&gt; recovery.zookeeper.storageDir: hdfs:///flink/recovery/<br>
&gt;<br>
&gt; yarn.application-attempts: 1000<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Everything is ok as long as recovery.mode is commented.<br>
&gt;<br>
&gt; As soon as I uncomment recovery.mode the deployment on yarn is stuck o=
n :<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; =E2=80=9CDeploying cluster, current state ACCEPTED=E2=80=9D.<br>
&gt;<br>
&gt; =E2=80=9CDeployment took more than 60 seconds=E2=80=A6.=E2=80=9D<br>
&gt;<br>
&gt; Every second.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; And I have more than enough resources available on my yarn cluster.<br=
>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Do you have any idea of what could cause this, and/or what logs I shou=
ld look for in order to understand ?<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; B.R.<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt; Gwenha=C3=ABl PASQUIERS<br>
&gt;<br>
&gt;<br>
&gt;<br>
&gt;<br>
<br>
</div></div></blockquote></div><br></div>

--047d7b3a888c3b89b70524e1c513--