From issues-return-173849-archive-asf-public=cust-asf.ponee.io@flink.apache.org Fri Jun 29 04:13:37 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 8B92A180662 for ; Fri, 29 Jun 2018 04:13:37 +0200 (CEST) Received: (qmail 9788 invoked by uid 500); 29 Jun 2018 02:13:36 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 9779 invoked by uid 99); 29 Jun 2018 02:13:36 -0000 Received: from git1-us-west.apache.org (HELO git1-us-west.apache.org) (140.211.11.23) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 29 Jun 2018 02:13:36 +0000 Received: by git1-us-west.apache.org (ASF Mail Server at git1-us-west.apache.org, from userid 33) id 74777E1072; Fri, 29 Jun 2018 02:13:36 +0000 (UTC) From: Clarkkkkk To: issues@flink.apache.org Reply-To: issues@flink.apache.org References: In-Reply-To: Subject: [GitHub] flink pull request #6192: [FLINK-9567][runtime][yarn] Fix the bug that Flink... Content-Type: text/plain Message-Id: <20180629021336.74777E1072@git1-us-west.apache.org> Date: Fri, 29 Jun 2018 02:13:36 +0000 (UTC) Github user Clarkkkkk commented on a diff in the pull request: https://github.com/apache/flink/pull/6192#discussion_r199036840 --- Diff: flink-yarn/src/main/java/org/apache/flink/yarn/YarnResourceManager.java --- @@ -334,8 +335,11 @@ public void onContainersCompleted(final List list) { if (yarnWorkerNode != null) { // Container completed unexpectedly ~> start a new one final Container container = yarnWorkerNode.getContainer(); - requestYarnContainer(container.getResource(), yarnWorkerNode.getContainer().getPriority()); - closeTaskManagerConnection(resourceId, new Exception(containerStatus.getDiagnostics())); + // check WorkerRegistration status to avoid requesting containers more than required + if (checkWorkerRegistrationWithResourceId(resourceId)) { --- End diff -- Yes, I might happen. The problem is not as easy as I thought. The actual cause of this problem is the resource was released before a full restart but the onContainerCompleted callback method happened after the full restart. As the full restart will requesting all the containers needed as configured, if the onContainerCompleted method was called after that, it will request for a new container and possess it which is not needed. ---