flink-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From Till Rohrmann <trohrm...@apache.org>
Subject Re: Question about flink client
Date Fri, 24 Mar 2017 15:09:29 GMT
Hi Yelei,

thanks for investigating the problem and pointing out the problematic
parts. In fact, I recently stumbled across the very same problem in the
JobClientActor and wrote a fix for it. It is already merged into the
master. I hope that this fix solved the problem you've described.


On Wed, Mar 22, 2017 at 3:42 PM, Yelei Feng <fengyel@outlook.com> wrote:

> Hi,
> I have two questions about flink client in interactive mode.
> One is for yarn-session.sh,  once the session CLI can't get cluster stauts
> (jobmanager is down), it will try to shutdown the cluster and cleanup
> related files even if new jobmanager will be created soon. As result,  yarn
> will fail to start a new jobmanager due to missing files on HDFS. As a
> workround, I can config `akka.lookup.timeout` to wait a bit longer,  say 60
> seconds. But I'm wondering if it will affect other components.
> Second is about flink cli. If cluster is down after submiting job using
> 'flink run xx.jar',  cli hangs there only showing "New JobManager elected.
> Connecting to null " instead of cleanup and close itself.
> After some digging, I found the main logic is in JobClientActor. It
> receives jobmanager status changes from two sources: zookeeper and akka
> deathwatch. It would terminate itself once receiving message
> `ConnectionTimeout`.
> Client sets current leaderSessionId and unwatch previous jobmanager from
> ZK; it receives `Teminated` of previous jobmanager from akka deathwatch and
> send `ConnectionTimeout` to itself after 60s. In a great chance, they would
> interfere with each other.
> Situation1:
>   1.  client get notified from zk, set leaderSessionId to null
>   2.  client unwatch previous jobmanager
>   3.  msg `Teminated` of previous jobmanager never got received
> Situation 2:
>   1.  msg `Teminated` of current jobmanager is received
>   2.  schedule msg ConnectionTimeout after 60s
>   3.  client get notified from zk, set `leaderSessionId` to null in less
> than 60s
>   4.  `ConnectionTimeout` will be filtered out due to different
> `leaderSessionId`
> Both of the two problems only happen in interactive mode,  not in detached
> mode.  I wonder if it's issues for interactive mode, or we should only use
> detached mode in production environment?
> Any insight is appreciated.
> Thanks,
> Yelei

  • Unnamed multipart/alternative (inline, None, 0 bytes)
View raw message