flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: YARN High Availability
Date Thu, 19 Nov 2015 11:22:26 GMT
Yes, that’s what I meant.

> On 19 Nov 2015, at 12:08, Till Rohrmann <trohrmann@apache.org> wrote:
> 
> You mean an additional start-up parameter for the `start-cluster.sh` script for the HA
case? That could work.
> 
> On Thu, Nov 19, 2015 at 11:54 AM, Aljoscha Krettek <aljoscha@apache.org> wrote:
> Maybe we could add a user parameter to specify a cluster name that is used to make the
paths unique.
> 
> 
> On Thu, Nov 19, 2015, 11:24 Till Rohrmann <trohrmann@apache.org> wrote:
> I agree that this would make the configuration easier. However, it entails also that
the user has to retrieve the randomized path from the logs if he wants to restart jobs after
the cluster has crashed or intentionally restarted. Furthermore, the system won't be able
to clean up old checkpoint and job handles in case that the cluster stop was intentional.
> 
> Thus, the question is how do we define the behaviour in order to retrieve handles and
to clean up old handles so that ZooKeeper won't be cluttered with old handles?
> 
> There are basically two modes:
> 
> 1. Keep state handles when shutting down the cluster. Provide a mean to define a fixed
path when starting the cluster and also a mean to purge old state handles. Furthermore, add
a shutdown mode where the handles under the current path are directly removed. This mode would
guarantee to always have the state handles available if not explicitly told differently. However,
the downside is that ZooKeeper will be cluttered most certainly.
> 
> 2. Remove the state handles when shutting down the cluster. Provide a shutdown mode where
we keep the state handles. This will keep ZooKeeper clean but will give you also the possibility
to keep a checkpoint around if necessary. However, the user is more likely to lose his state
when shutting down the cluster.
> 
> On Thu, Nov 19, 2015 at 10:55 AM, Robert Metzger <rmetzger@apache.org> wrote:
> I agree with Aljoscha. Many companies install Flink (and its config) in a central directory
and users share that installation.
> 
> On Thu, Nov 19, 2015 at 10:45 AM, Aljoscha Krettek <aljoscha@apache.org> wrote:
> I think we should find a way to randomize the paths where the HA stuff stores data. If
users don’t realize that they store data in the same paths this could lead to problems.
> 
> > On 19 Nov 2015, at 08:50, Till Rohrmann <trohrmann@apache.org> wrote:
> >
> > Hi Gwenhaël,
> >
> > good to hear that you could resolve the problem.
> >
> > When you run multiple HA flink jobs in the same cluster, then you don’t have to
adjust the configuration of Flink. It should work out of the box.
> >
> > However, if you run multiple HA Flink cluster, then you have to set for each cluster
a distinct ZooKeeper root path via the option recovery.zookeeper.path.root in the Flink configuraiton.
This is necessary because otherwise all JobManagers (the ones of the different clusters) will
compete for a single leadership. Furthermore, all TaskManagers will only see the one and only
leader and connect to it. The reason is that the TaskManagers will look up their leader at
a ZNode below the ZooKeeper root path.
> >
> > If you have other questions then don’t hesitate asking me.
> >
> > Cheers,
> > Till
> >
> >
> > On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>
wrote:
> > Nevermind,
> >
> >
> >
> > Looking at the logs I saw that it was having issues trying to connect to ZK.
> >
> > To make I short is had the wrong port.
> >
> >
> >
> > It is now starting.
> >
> >
> >
> > Tomorrow I’ll try to kill some JobManagers *evil*.
> >
> >
> >
> > Another question : if I have multiple HA flink jobs, are there some points to check
in order to be sure that they won’t collide on hdfs or ZK ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenhaël PASQUIERS
> >
> >
> >
> > From: Till Rohrmann [mailto:till.rohrmann@gmail.com]
> > Sent: mercredi 18 novembre 2015 18:01
> > To: user@flink.apache.org
> > Subject: Re: YARN High Availability
> >
> >
> >
> > Hi Gwenhaël,
> >
> >
> >
> > do you have access to the yarn logs?
> >
> >
> >
> > Cheers,
> >
> > Till
> >
> >
> >
> > On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>
wrote:
> >
> > Hello,
> >
> >
> >
> > We’re trying to set up high availability using an existing zookeeper quorum already
running in our Cloudera cluster.
> >
> >
> >
> > So, as per the doc we’ve changed the max attempt in yarn’s config as well as
the flink.yaml.
> >
> >
> >
> > recovery.mode: zookeeper
> >
> > recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
> >
> > state.backend: filesystem
> >
> > state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
> >
> > recovery.zookeeper.storageDir: hdfs:///flink/recovery/
> >
> > yarn.application-attempts: 1000
> >
> >
> >
> > Everything is ok as long as recovery.mode is commented.
> >
> > As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
> >
> >
> >
> > “Deploying cluster, current state ACCEPTED”.
> >
> > “Deployment took more than 60 seconds….”
> >
> > Every second.
> >
> >
> >
> > And I have more than enough resources available on my yarn cluster.
> >
> >
> >
> > Do you have any idea of what could cause this, and/or what logs I should look for
in order to understand ?
> >
> >
> >
> > B.R.
> >
> >
> >
> > Gwenhaël PASQUIERS
> >
> >
> >
> >
> 
> 
> 
> 


Mime
View raw message