flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Gwenhael Pasquiers <gwenhael.pasqui...@ericsson.com>
Subject RE: YARN High Availability
Date Wed, 18 Nov 2015 17:37:51 GMT

Looking at the logs I saw that it was having issues trying to connect to ZK.
To make I short is had the wrong port.

It is now starting.

Tomorrow I’ll try to kill some JobManagers *evil*.

Another question : if I have multiple HA flink jobs, are there some points to check in order
to be sure that they won’t collide on hdfs or ZK ?



From: Till Rohrmann [mailto:till.rohrmann@gmail.com]
Sent: mercredi 18 novembre 2015 18:01
To: user@flink.apache.org
Subject: Re: YARN High Availability

Hi Gwenhaël,

do you have access to the yarn logs?


On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com<mailto:gwenhael.pasquiers@ericsson.com>>

We’re trying to set up high availability using an existing zookeeper quorum already running
in our Cloudera cluster.

So, as per the doc we’ve changed the max attempt in yarn’s config as well as the flink.yaml.

recovery.mode: zookeeper
recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
state.backend: filesystem
state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
recovery.zookeeper.storageDir: hdfs:///flink/recovery/
yarn.application-attempts: 1000

Everything is ok as long as recovery.mode is commented.
As soon as I uncomment recovery.mode the deployment on yarn is stuck on :

“Deploying cluster, current state ACCEPTED”.
“Deployment took more than 60 seconds….”
Every second.

And I have more than enough resources available on my yarn cluster.

Do you have any idea of what could cause this, and/or what logs I should look for in order
to understand ?



View raw message