flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-9891) Flink cluster is not shutdown in YARN mode when Flink client is stopped
Date Wed, 26 Sep 2018 12:07:00 GMT

    [ https://issues.apache.org/jira/browse/FLINK-9891?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16628662#comment-16628662
] 

ASF GitHub Bot commented on FLINK-9891:
---------------------------------------

asfgit closed pull request #6540: [FLINK-9891] Added hook to shutdown cluster if a session
was created in per-job mode.
URL: https://github.com/apache/flink/pull/6540
 
 
   

This is a PR merged from a forked repository.
As GitHub hides the original diff on merge, it is displayed below for
the sake of provenance:

As this is a foreign pull request (from a fork), the diff is supplied
below (as it won't show otherwise due to GitHub magic):

diff --git a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java b/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
index 2e78e4a615b..780f8144029 100644
--- a/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
+++ b/flink-clients/src/main/java/org/apache/flink/client/cli/CliFrontend.java
@@ -58,6 +58,7 @@
 import org.apache.flink.util.ExceptionUtils;
 import org.apache.flink.util.FlinkException;
 import org.apache.flink.util.Preconditions;
+import org.apache.flink.util.ShutdownHookUtil;
 
 import org.apache.commons.cli.CommandLine;
 import org.apache.commons.cli.Options;
@@ -249,13 +250,22 @@ protected void run(String[] args) throws Exception {
 					LOG.info("Could not properly shut down the client.", e);
 				}
 			} else {
+				final Thread shutdownHook;
 				if (clusterId != null) {
 					client = clusterDescriptor.retrieve(clusterId);
+					shutdownHook = null;
 				} else {
 					// also in job mode we have to deploy a session cluster because the job
 					// might consist of multiple parts (e.g. when using collect)
 					final ClusterSpecification clusterSpecification = customCommandLine.getClusterSpecification(commandLine);
 					client = clusterDescriptor.deploySessionCluster(clusterSpecification);
+					// if not running in detached mode, add a shutdown hook to shut down cluster if client
exits
+					// there's a race-condition here if cli is killed before shutdown hook is installed
+					if (!runOptions.getDetachedMode()) {
+						shutdownHook = ShutdownHookUtil.addShutdownHook(client::shutDownCluster, client.getClass().getSimpleName(),
LOG);
+					} else {
+						shutdownHook = null;
+					}
 				}
 
 				try {
@@ -278,12 +288,12 @@ protected void run(String[] args) throws Exception {
 
 					executeProgram(program, client, userParallelism);
 				} finally {
-					if (clusterId == null && !client.isDetached()) {
+					if (shutdownHook != null) {
 						// terminate the cluster only if we have started it before and if it's not detached
 						try {
-							client.shutDownCluster();
-						} catch (final Exception e) {
-							LOG.info("Could not properly terminate the Flink cluster.", e);
+							shutdownHook.run();
+						} finally {
+							ShutdownHookUtil.removeShutdownHook(shutdownHook, client.getClass().getSimpleName(),
LOG);
 						}
 					}
 


 

----------------------------------------------------------------
This is an automated message from the Apache Git Service.
To respond to the message, please log on GitHub and use the
URL above to go to the specific comment.
 
For queries about this service, please contact Infrastructure at:
users@infra.apache.org


> Flink cluster is not shutdown in YARN mode when Flink client is stopped
> -----------------------------------------------------------------------
>
>                 Key: FLINK-9891
>                 URL: https://issues.apache.org/jira/browse/FLINK-9891
>             Project: Flink
>          Issue Type: Bug
>          Components: Client, YARN
>    Affects Versions: 1.5.0, 1.5.1
>            Reporter: Sergey Krasovskiy
>            Assignee: Andrey Zagrebin
>            Priority: Major
>              Labels: pull-request-available
>             Fix For: 1.7.0, 1.6.2, 1.5.5
>
>
> We are not using session mode and detached mode. The command to run Flink job on YARN
is:
> {code:java}
> <flink-1.5.1>/bin/flink run -m yarn-cluster -yn 1 -yqu flink -yjm 768 -ytm 2048
-j ./flink-quickstart-java-1.0-SNAPSHOT.jar -c org.test.WordCount
> {code}
> Flink CLI logs:
> {code:java}
> Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.10-1/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2018-07-18 12:47:03,747 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl
- Timeline service address: http://hmaster-1.ipbl.rgcloud.net:8188/ws/v1/timeline/
> 2018-07-18 12:47:04,222 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor
to locate the jar
> 2018-07-18 12:47:04,222 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
for the flink jar passed. Using the location of class org.apache.flink.yarn.YarnClusterDescriptor
to locate the jar
> 2018-07-18 12:47:04,248 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither
the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set. The Flink YARN Client
needs one of these to be set to properly load the Hadoop configuration for accessing YARN.
> 2018-07-18 12:47:04,409 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster
specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=2048, numberTaskManagers=1,
slotsPerTaskManager=1}
> 2018-07-18 12:47:04,783 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory
- The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
> 2018-07-18 12:47:04,788 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The
configuration directory ('/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/conf')
contains both LOG4J and Logback configuration files. Please delete or rename one of them.
> 2018-07-18 12:47:07,846 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting
application master application_1531474158783_10814
> 2018-07-18 12:47:08,073 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl -
Submitted application application_1531474158783_10814
> 2018-07-18 12:47:08,074 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting
for the cluster to be allocated
> 2018-07-18 12:47:08,076 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying
cluster, current state ACCEPTED
> 2018-07-18 12:47:12,864 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN
application has been deployed successfully.
> {code}
> Job Manager logs:
> {code:java}
> 2018-07-18 12:47:09,913 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
--------------------------------------------------------------------------------
> 2018-07-18 12:47:09,915 INFO org.apache.flink.runtime.entrypoint.ClusterEntrypoint -
Starting YarnSessionClusterEntrypoint (Version: 1.5.1, Rev:3488f8b, Date:10.07.2018 @ 11:51:27
GMT)
> ...
> {code}
> Issues:
>  # Flink job is running as a Flink session
>  # Ctrl+C or 'stop' doesn't stop a job and YARN cluster
>  # Cancel job via Job Maanager web ui doesn't stop Flink cluster. To kill the cluster
we need to run: yarn application -kill <id>
> We also tried to run a flink job with 'mode: legacy' and we have the same issues:
>  # Add property 'mode: legacy' to ./conf/flink-conf.yaml
>  # Execute the following command:
> {code:java}
> <flink-1.5.1>/bin/flink run -m yarn-cluster -yn 1 -yqu flink -yjm 768 -ytm 2048
-j ./flink-quickstart-java-1.0-SNAPSHOT.jar -c org.test.WordCount
> {code}
> Flink CLI logs:
> {code:java}
> Setting HADOOP_CONF_DIR=/etc/hadoop/conf because no HADOOP_CONF_DIR was set.
> SLF4J: Class path contains multiple SLF4J bindings.
> SLF4J: Found binding in [jar:file:/opt/flink-streaming/flink-streaming-1.5.1-1.5.1-bin-hadoop27-scala_2.11-1531485329/lib/slf4j-log4j12-1.7.7.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: Found binding in [jar:file:/usr/hdp/2.4.2.10-1/hadoop/lib/slf4j-log4j12-1.7.10.jar!/org/slf4j/impl/StaticLoggerBinder.class]
> SLF4J: See http://www.slf4j.org/codes.html#multiple_bindings for an explanation.
> SLF4J: Actual binding is of type [org.slf4j.impl.Log4jLoggerFactory]
> 2018-07-18 16:07:13,820 INFO org.apache.hadoop.yarn.client.api.impl.TimelineClientImpl
- Timeline service address: http://hmaster-1.ipbl.rgcloud.net:8188/ws/v1/timeline/
> 2018-07-18 16:07:14,165 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor
to locate the jar
> 2018-07-18 16:07:14,165 INFO org.apache.flink.yarn.cli.FlinkYarnSessionCli - No path
for the flink jar passed. Using the location of class org.apache.flink.yarn.LegacyYarnClusterDescriptor
to locate the jar
> 2018-07-18 16:07:14,182 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - Neither
the HADOOP_CONF_DIR nor the YARN_CONF_DIR environment variable is set. The Flink YARN Client
needs one of these to be set to properly load the Hadoop configuration for accessing YARN.
> 2018-07-18 16:07:14,356 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Cluster
specification: ClusterSpecification{masterMemoryMB=768, taskManagerMemoryMB=2048, numberTaskManagers=1,
slotsPerTaskManager=1}
> 2018-07-18 16:07:14,703 WARN org.apache.hadoop.hdfs.shortcircuit.DomainSocketFactory
- The short-circuit local reads feature cannot be used because libhadoop cannot be loaded.
> 2018-07-18 16:07:14,708 WARN org.apache.flink.yarn.AbstractYarnClusterDescriptor - The
configuration directory ('/home/skrasovs/flink-conf') contains both LOG4J and Logback configuration
files. Please delete or rename one of them.
> 2018-07-18 16:07:17,678 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Submitting
application master application_1531474158783_10843
> 2018-07-18 16:07:17,717 INFO org.apache.hadoop.yarn.client.api.impl.YarnClientImpl -
Submitted application application_1531474158783_10843
> 2018-07-18 16:07:17,717 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Waiting
for the cluster to be allocated
> 2018-07-18 16:07:17,720 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - Deploying
cluster, current state ACCEPTED
> 2018-07-18 16:07:23,527 INFO org.apache.flink.yarn.AbstractYarnClusterDescriptor - YARN
application has been deployed successfully.
> Using the parallelism provided by the remote cluster (1). To use another parallelism,
set it at the ./bin/flink client.
> Starting execution of program
> 2018-07-18 16:07:23,551 INFO org.apache.flink.yarn.YarnClusterClient - Starting program
in interactive mode (detached: false)
> {code}
> Job Manager logs:
> {code:java}
> 2018-07-18 16:07:19,831 INFO org.apache.flink.yarn.YarnApplicationMasterRunner - --------------------------------------------------------------------------------
> 2018-07-18 16:07:19,833 INFO org.apache.flink.yarn.YarnApplicationMasterRunner - Starting
YARN ApplicationMaster / ResourceManager / JobManager (Version: 1.5.1, Rev:3488f8b, Date:10.07.2018
@ 11:51:27 GMT)
> {code}



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

Mime
View raw message