Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 36744200BCA for ; Mon, 21 Nov 2016 12:24:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 314CB160AF9; Mon, 21 Nov 2016 11:24:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 7B20D160AEC for ; Mon, 21 Nov 2016 12:23:59 +0100 (CET) Received: (qmail 76783 invoked by uid 500); 21 Nov 2016 11:23:58 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 76772 invoked by uid 99); 21 Nov 2016 11:23:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 21 Nov 2016 11:23:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 77F392C0D55 for ; Mon, 21 Nov 2016 11:23:58 +0000 (UTC) Date: Mon, 21 Nov 2016 11:23:58 +0000 (UTC) From: "Maximilian Michels (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (FLINK-5081) unable to set yarn.maximum-failed-containers with flink one-time YARN setup MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Mon, 21 Nov 2016 11:24:00 -0000 [ https://issues.apache.org/jira/browse/FLINK-5081?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15683082#comment-15683082 ] Maximilian Michels edited comment on FLINK-5081 at 11/21/16 11:23 AM: ---------------------------------------------------------------------- I've had a second look. -The issue is not that the configuration is not loaded. Moreover, your finding reveals at least two other issues with our per-job YARN implementation:- -1. When executing in non-detached job submission mode, the "Client Shutdown Hook" shuts down the Yarn application in case of job failures (e.g. TaskManager dies). We should remove the shutdown hook. It should only be active during deployment.- -2. The per-job Yarn application is supposed to automatically shut down the cluster after job completion. In case of failures (e.g. TaskManager dies) the shutdown apparently is performed as well although it shouldn't.- edit: 1) is not an issue since it only shuts down when it reaches a terminal state. 2) Is an issue but unrelated to this issue -The actual issue here is that the JobManager informs the client of the failed job and the client shuts down the cluster. We should differentiate between fatal and non-fatal failures in the client.- edit2: Not an issue either :) Probably you forgot to configure the restart strategy. {noformat} # defaults to none restart-strategy: fixed-delay # defaults to 1 restart-strategy.fixed-delay.attempts: 10 {noformat} With that in place, we can kill TaskManagers and resume job execution after a container restart without problems. was (Author: mxm): I've had a second look. -The issue is not that the configuration is not loaded. Moreover, your finding reveals at least two other issues with our per-job YARN implementation:- -1. When executing in non-detached job submission mode, the "Client Shutdown Hook" shuts down the Yarn application in case of job failures (e.g. TaskManager dies). We should remove the shutdown hook. It should only be active during deployment.- -2. The per-job Yarn application is supposed to automatically shut down the cluster after job completion. In case of failures (e.g. TaskManager dies) the shutdown apparently is performed as well although it shouldn't.- edit: 1) is not an issue since it only shuts down when it reaches a terminal state. 2) Is an issue but unrelated to this issue The actual issue here is that the JobManager informs the client of the failed job and the client shuts down the cluster. We should differentiate between fatal and non-fatal failures in the client. > unable to set yarn.maximum-failed-containers with flink one-time YARN setup > --------------------------------------------------------------------------- > > Key: FLINK-5081 > URL: https://issues.apache.org/jira/browse/FLINK-5081 > Project: Flink > Issue Type: Bug > Components: Startup Shell Scripts > Affects Versions: 1.2.0, 1.1.4 > Reporter: Nico Kruber > Assignee: Maximilian Michels > Fix For: 1.2.0, 1.1.4 > > > When letting flink setup YARN for a one-time job, it apparently does not deliver the {{yarn.maximum-failed-containers}} parameter to YARN as the {{yarn-session.sh}} script does. Adding it to conf/flink-conf.yaml as > https://ci.apache.org/projects/flink/flink-docs-master/setup/yarn_setup.html#recovery-behavior-of-flink-on-yarn suggested also does not work. > example: > {code:none} > flink run -m yarn-cluster -yn 3 -yjm 1024 -ytm 4096 .jar --parallelism 3 -Dyarn.maximum-failed-containers=100 > {code} -- This message was sent by Atlassian JIRA (v6.3.4#6332)