flink-user mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Aljoscha Krettek <aljos...@apache.org>
Subject Re: YARN High Availability
Date Thu, 19 Nov 2015 09:45:54 GMT
I think we should find a way to randomize the paths where the HA stuff stores data. If users
don’t realize that they store data in the same paths this could lead to problems.

> On 19 Nov 2015, at 08:50, Till Rohrmann <trohrmann@apache.org> wrote:
> 
> Hi Gwenhaël,
> 
> good to hear that you could resolve the problem.
> 
> When you run multiple HA flink jobs in the same cluster, then you don’t have to adjust
the configuration of Flink. It should work out of the box.
> 
> However, if you run multiple HA Flink cluster, then you have to set for each cluster
a distinct ZooKeeper root path via the option recovery.zookeeper.path.root in the Flink configuraiton.
This is necessary because otherwise all JobManagers (the ones of the different clusters) will
compete for a single leadership. Furthermore, all TaskManagers will only see the one and only
leader and connect to it. The reason is that the TaskManagers will look up their leader at
a ZNode below the ZooKeeper root path.
> 
> If you have other questions then don’t hesitate asking me.
> 
> Cheers,
> Till
> 
> 
> On Wed, Nov 18, 2015 at 6:37 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>
wrote:
> Nevermind,
> 
>  
> 
> Looking at the logs I saw that it was having issues trying to connect to ZK.
> 
> To make I short is had the wrong port.
> 
>  
> 
> It is now starting.
> 
>  
> 
> Tomorrow I’ll try to kill some JobManagers *evil*.
> 
>  
> 
> Another question : if I have multiple HA flink jobs, are there some points to check in
order to be sure that they won’t collide on hdfs or ZK ?
> 
>  
> 
> B.R.
> 
>  
> 
> Gwenhaël PASQUIERS
> 
>  
> 
> From: Till Rohrmann [mailto:till.rohrmann@gmail.com] 
> Sent: mercredi 18 novembre 2015 18:01
> To: user@flink.apache.org
> Subject: Re: YARN High Availability
> 
>  
> 
> Hi Gwenhaël,
> 
>  
> 
> do you have access to the yarn logs?
> 
>  
> 
> Cheers,
> 
> Till
> 
>  
> 
> On Wed, Nov 18, 2015 at 5:55 PM, Gwenhael Pasquiers <gwenhael.pasquiers@ericsson.com>
wrote:
> 
> Hello,
> 
>  
> 
> We’re trying to set up high availability using an existing zookeeper quorum already
running in our Cloudera cluster.
> 
>  
> 
> So, as per the doc we’ve changed the max attempt in yarn’s config as well as the
flink.yaml.
> 
>  
> 
> recovery.mode: zookeeper
> 
> recovery.zookeeper.quorum: host1:3181,host2:3181,host3:3181
> 
> state.backend: filesystem
> 
> state.backend.fs.checkpointdir: hdfs:///flink/checkpoints
> 
> recovery.zookeeper.storageDir: hdfs:///flink/recovery/
> 
> yarn.application-attempts: 1000
> 
>  
> 
> Everything is ok as long as recovery.mode is commented.
> 
> As soon as I uncomment recovery.mode the deployment on yarn is stuck on :
> 
>  
> 
> “Deploying cluster, current state ACCEPTED”.
> 
> “Deployment took more than 60 seconds….”
> 
> Every second.
> 
>  
> 
> And I have more than enough resources available on my yarn cluster.
> 
>  
> 
> Do you have any idea of what could cause this, and/or what logs I should look for in
order to understand ?
> 
>  
> 
> B.R.
> 
>  
> 
> Gwenhaël PASQUIERS
> 
>  
> 
> 


Mime
View raw message