flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Yelei Feng (JIRA)" <j...@apache.org>
Subject [jira] [Updated] (FLINK-6147) flink client can't detect cluster is down
Date Wed, 22 Mar 2017 03:11:41 GMT

     [ https://issues.apache.org/jira/browse/FLINK-6147?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel
]

Yelei Feng updated FLINK-6147:
------------------------------
    Description: 
I tested in yarn mode, reproduce step:
1. flink run xx.jar
2. kill yarn application
CLI hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup
and close itself.

After some digging, I found the main logic is in {{JobClientActor}}. It would terminate itself
once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes from two
sources: zookeeper and akka deathwatch. Client sets current  {{leaderSessionId}} and unwatch
previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from akka deathwatch
and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would interfere
with each other.
 
Situation1:
1. client get notified from zk, set {{leaderSessionId}} to null
2. client unwatch previous jobmanager
3. msg {{Teminated}} of previous jobmanager never got received

Situation 2:
1. msg {{Teminated}} of current jobmanager is received
2. schedule msg {{ConnectionTimeout}} after 60s
3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s
4. {{ConnectionTimeout}} will be filtered out due to different  {{leaderSessionId}}

  was:
I tested in yarn mode, reproduce step:
1. flink run xx.jar
2. kill yarn application
CLI hangs there only showing "New JobManager elected. Connecting to null " instead of cleanup
and close itself.

After some digging, I found the main logic is in {{JobClientActor}}. It would terminate itself
once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes from two
sources: zookeeper and akka deathwatch. Client sets current  {{leaderSessionId} and unwatch
previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from akka deathwatch
and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they would interfere
with each other.
 
Situation1:
1. client get notified from zk, set {{leaderSessionId}} to null
2. client unwatch previous jobmanager
3. msg {{Teminated}} of previous jobmanager never got received

Situation 2:
1. msg {{Teminated}} of current jobmanager is received
2. schedule msg {{ConnectionTimeout}} after 60s
3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s
4. {{ConnectionTimeout}} will be filtered out due to different  {{leaderSessionId}}


> flink client can't detect cluster is down
> -----------------------------------------
>
>                 Key: FLINK-6147
>                 URL: https://issues.apache.org/jira/browse/FLINK-6147
>             Project: Flink
>          Issue Type: Bug
>          Components: Client
>    Affects Versions: 1.2.0, 1.3.0
>            Reporter: Yelei Feng
>              Labels: client
>
> I tested in yarn mode, reproduce step:
> 1. flink run xx.jar
> 2. kill yarn application
> CLI hangs there only showing "New JobManager elected. Connecting to null " instead of
cleanup and close itself.
> After some digging, I found the main logic is in {{JobClientActor}}. It would terminate
itself once receiving message {{ConnectionTimeout}}. It receive jobmanager status changes
from two sources: zookeeper and akka deathwatch. Client sets current  {{leaderSessionId}}
and unwatch previous jobmanager from zk, receives {{Teminated}} of previous jobmanager from
akka deathwatch and send {{ConnectionTimeout}} to itself after 60s. In a great chance, they
would interfere with each other.
>  
> Situation1:
> 1. client get notified from zk, set {{leaderSessionId}} to null
> 2. client unwatch previous jobmanager
> 3. msg {{Teminated}} of previous jobmanager never got received
> Situation 2:
> 1. msg {{Teminated}} of current jobmanager is received
> 2. schedule msg {{ConnectionTimeout}} after 60s
> 3. client get notified from zk, set {{leaderSessionId}} to null in less than 60s
> 4. {{ConnectionTimeout}} will be filtered out due to different  {{leaderSessionId}}



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

Mime
View raw message