flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Robert Metzger <rmetz...@apache.org>
Subject Re: YARN High Availability
Date Thu, 19 Nov 2015 09:55:59 GMT
I agree with Aljoscha. Many companies install Flink (and its config) in a
central directory and users share that installation.

On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljoscha@apache.org>
wrote:

> I think we should find a way to randomize the paths where the HA stuff
> stores data. If users don’t realize that they store data in the same paths
> this could lead to problems.
>
> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrmann@apache.org> wrote:
> >
> > Hi Gwenhaël,
> >
> > good to hear that you could resolve the problem.
> >
> > When you run multiple HA flink jobs in the same cluster, then you don’t
> have to adjust the configuration of Flink. It should work out of the box.
> >
> > However, if you run multiple HA Flink cluster, then you have to set for
> each cluster a distinct ZooKeeper root path via the option
> recovery.zookeeper.path.root in the Flink configuraiton. This is necessary
> because otherwise all JobManagers (the ones of the different clusters) will
> compete for a single leadership. Furthermore, all TaskManagers will only
> see the one and only leader and connect to it. The reason is that the
> TaskManagers will look up their leader at a ZNode below the ZooKeeper root
> path.
> >
> > If you have other questions then don’t hesitate asking me.
> >
> > Cheers,
> > Till
> >
> >
> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <
> gwenhael.pasquiers@ericsson.com> wrote:
> > Nevermind,
> >
> >
> >
> > Looking at the logs I saw that it was having issues trying to connect to
> ZK.
> >
> > To make I short is had the wrong port.
> >
> >
> >
> > It is now starting.
> >
> >
> >
> > Tomorrow I’ll try to kill some JobManagers *evil*.
> >
> >
> >
> > Another question : if I have multiple HA flink jobs, are there some
> points to check in order to be sure that they won’t collide on hdfs or ZK ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenhaël PASQUIERS
> >
> >
> >
> > From: Till Rohrmann [mailto:till.rohrmann@gmail.com]
> > Sent: mercredi 18 novembre 2015 18:01
> > To: user@flink.apache.org
> > Subject: Re: YARN High Availability
> >
> >
> >
> > Hi Gwenhaël,
> >
> >
> >
> > do you have access to the yarn logs?
> >
> >
> >
> > Cheers,
> >
> > Till
> >
> >
> >
> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <
> gwenhael.pasquiers@ericsson.com> wrote:
> >
> > Hello,
> >
> >
> >
> > We’re trying to set up high availability using an existing zookeeper
> quorum already running in our Cloudera cluster.
> >
> >
> >
> > So, as per the doc we’ve changed the max attempt in yarn’s config as
> well as the flink.yaml.
> >
> >
> >
> > recovery.mode: zookeeper
> >
> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
> >
> > state.backend: filesystem
> >
> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
> >
> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/
> >
> > yarn.application-attempts: 1000
> >
> >
> >
> > Everything is ok as long as recovery.mode is commented.
> >
> > As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
> >
> >
> >
> > “Deploying cluster, current state ACCEPTED”.
> >
> > “Deployment took more than 60 seconds….”
> >
> > Every second.
> >
> >
> >
> > And I have more than enough resources available on my yarn cluster.
> >
> >
> >
> > Do you have any idea of what could cause this, and/or what logs I should
> look for in order to understand ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenhaël PASQUIERS
> >
> >
> >
> >
>
>

Mime
View raw message