Mailing-List: contact dev-help@flink.apache.org; run by ezmlm
Precedence: bulk
Reply-To: dev@flink.apache.org
Date: Wed, 4 Nov 2015 15:44:27 +0000 (UTC)
From: "Ufuk Celebi (JIRA)" <jira@apache.org>
To: dev@flink.apache.org
Message-ID: <JIRA.12910374.1446651847000.151714.1446651867802@Atlassian.JIRA>
In-Reply-To: <JIRA.12910374.1446651847000@Atlassian.JIRA>
References: <JIRA.12910374.1446651847000@Atlassian.JIRA>
 <JIRA.12910374.1446651847558@arcas>
Subject: [jira] [Created] (FLINK-2969) FlinkYarnSessionCli with recovery
 enabled fails when killing TaskManager
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: 7bit

Ufuk Celebi created FLINK-2969:
----------------------------------

             Summary: FlinkYarnSessionCli with recovery enabled fails when killing TaskManager
                 Key: FLINK-2969
                 URL: https://issues.apache.org/jira/browse/FLINK-2969
             Project: Flink
          Issue Type: Bug
          Components: Distributed Runtime, YARN Client
    Affects Versions: 0.10
            Reporter: Ufuk Celebi


I'm running a YARN session with 2 physical nodes and 5 containers (ApplicationMaster and 4 TaskManagers). There is no Flink program submitted to the cluster.

Running a sequence of failure operations (killing the ApplicationMaster and TaskManager containers), I sometimes get the following Exception after killing a TaskManager:

{code}
15:31:20,721 WARN  org.apache.flink.client.FlinkYarnSessionCli                   - Exception while running the interactive command line interface
java.lang.RuntimeException: Unable to get Cluster status from Application Client
	at org.apache.flink.yarn.FlinkYarnCluster.getClusterStatus(FlinkYarnCluster.java:307)
	at org.apache.flink.client.FlinkYarnSessionCli.runInteractiveCli(FlinkYarnSessionCli.java:296)
	at org.apache.flink.client.FlinkYarnSessionCli.run(FlinkYarnSessionCli.java:455)
	at org.apache.flink.client.FlinkYarnSessionCli.main(FlinkYarnSessionCli.java:351)
Caused by: akka.pattern.AskTimeoutException: Recipient[Actor[akka://flink/user/applicationClient#-607831833]] had already been terminated.
	at akka.pattern.AskableActorRef$.ask$extension(AskSupport.scala:132)
	at akka.pattern.AskableActorRef$.$qmark$extension(AskSupport.scala:144)
	at akka.pattern.AskSupport$class.ask(AskSupport.scala:75)
	at akka.pattern.package$.ask(package.scala:43)
	at akka.pattern.Patterns$.ask(Patterns.scala:47)
	at akka.pattern.Patterns.ask(Patterns.scala)
	at org.apache.flink.yarn.FlinkYarnCluster.getClusterStatus(FlinkYarnCluster.java:302)
	... 3 more
{code}

I would like to investigate this for the 0.10.1/1.0 release and not block the current RC.


--
This message was sent by Atlassian JIRA
(v6.3.4#6332)