From issues-return-179530-archive-asf-public=cust-asf.ponee.io@flink.apache.org Mon Jul 23 16:39:04 2018 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id 3FFE0180677 for ; Mon, 23 Jul 2018 16:39:04 +0200 (CEST) Received: (qmail 86256 invoked by uid 500); 23 Jul 2018 14:39:03 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 86247 invoked by uid 99); 23 Jul 2018 14:39:03 -0000 Received: from pnap-us-west-generic-nat.apache.org (HELO spamd3-us-west.apache.org) (209.188.14.142) by apache.org (qpsmtpd/0.29) with ESMTP; Mon, 23 Jul 2018 14:39:03 +0000 Received: from localhost (localhost [127.0.0.1]) by spamd3-us-west.apache.org (ASF Mail Server at spamd3-us-west.apache.org) with ESMTP id EB7E81800C9 for ; Mon, 23 Jul 2018 14:39:02 +0000 (UTC) X-Virus-Scanned: Debian amavisd-new at spamd3-us-west.apache.org X-Spam-Flag: NO X-Spam-Score: -109.251 X-Spam-Level: X-Spam-Status: No, score=-109.251 tagged_above=-999 required=6.31 tests=[ENV_AND_HDR_SPF_MATCH=-0.5, KAM_ASCII_DIVIDERS=0.8, KAM_LOTSOFHASH=0.25, RCVD_IN_DNSWL_MED=-2.3, SPF_PASS=-0.001, USER_IN_DEF_SPF_WL=-7.5, USER_IN_WHITELIST=-100] autolearn=disabled Received: from mx1-lw-us.apache.org ([10.40.0.8]) by localhost (spamd3-us-west.apache.org [10.40.0.10]) (amavisd-new, port 10024) with ESMTP id BhvuZDO29Pal for ; Mon, 23 Jul 2018 14:39:01 +0000 (UTC) Received: from mailrelay1-us-west.apache.org (mailrelay1-us-west.apache.org [209.188.14.139]) by mx1-lw-us.apache.org (ASF Mail Server at mx1-lw-us.apache.org) with ESMTP id 2CA505F385 for ; Mon, 23 Jul 2018 14:39:01 +0000 (UTC) Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id 9E2B8E02E5 for ; Mon, 23 Jul 2018 14:39:00 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 5BFD92713F for ; Mon, 23 Jul 2018 14:39:00 +0000 (UTC) Date: Mon, 23 Jul 2018 14:39:00 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-9912) Release TaskExecutors from SlotPool if all slots have been removed MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/FLINK-9912?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16552938#comment-16552938 ] ASF GitHub Bot commented on FLINK-9912: --------------------------------------- GitHub user tillrohrmann opened a pull request: https://github.com/apache/flink/pull/6394 [FLINK-9912][JM] Release TaskExecutors if they have no slots registered at SlotPool ## What is the purpose of the change This commit extends the SlotPools behaviour when failing an allocation by sending a notification message to the TaskExecutor about the freed slot. Moreover, it checks whether the affected TaskExecutor has more slots registered or not. In the latter case, the TaskExecutor's connection will be eagerly closed. This PR is based on #6389. ## Brief change log - send `freeSlot` message to owning `TaskExecutor` of failed `AllocatedSlot` - close `TaskExecutor` connection if it no longer has slots registered at the `JobMaster` ## Verifying this change - Added `SlotPoolTest#testFreeFailedSlots`, `SlotPoolTest#testFailingAllocationFailsPendingSlotRequests` and `JobMasterTest#testReleasingTaskExecutorIfNoMoreSlotsRegistered` ## Does this pull request potentially affect one of the following parts: - Dependencies (does it add or upgrade a dependency): (no) - The public API, i.e., is any changed class annotated with `@Public(Evolving)`: (no) - The serializers: (no) - The runtime per-record code paths (performance sensitive): (no) - Anything that affects deployment or recovery: JobManager (and its components), Checkpointing, Yarn/Mesos, ZooKeeper: (yes) - The S3 file system connector: (no) ## Documentation - Does this pull request introduce a new feature? (no) - If yes, how is the feature documented? (not applicable) You can merge this pull request into a Git repository by running: $ git pull https://github.com/tillrohrmann/flink releaseTaskExecutors Alternatively you can review and apply these changes as the patch at: https://github.com/apache/flink/pull/6394.patch To close this pull request, make a commit to your master/trunk branch with (at least) the following in the commit message: This closes #6394 ---- commit 52cd0269951f6ee2c86ca05aa95f6b43dfdd256c Author: Till Rohrmann Date: 2018-07-19T11:07:44Z [FLINK-9838][logging] Don't log slot request failures on the ResourceManager commit eac64952425fcc9ce51c768ac953523116661ef9 Author: Till Rohrmann Date: 2018-07-19T11:41:03Z [hotfix] Improve logging of SlotPool and SlotSharingManager commit b474cda88812d63d38e8294b4347ecbc554c4597 Author: Till Rohrmann Date: 2018-07-22T18:05:05Z [FLINK-9908][scheduling] Do not cancel individual scheduling future Since the individual scheduling futures contain logic to release the slot if it cannot be assigned to the Execution, we must not cancel them. Otherwise we might risk that slots are not returned to the SlotPool leaving it in an inconsistent state. commit f997860c2a0c479ea4036f0a7174b64f2b3acfc9 Author: Till Rohrmann Date: 2018-07-22T18:17:11Z [FLINK-9909][core] ConjunctFuture does not cancel input futures If a ConjunctFuture is cancelled, then it won't cancel all of its input futures automatically. If the users needs this behaviour then he has to implement it explicitly. The reason for this change is that an implicit cancellation can have unwanted side effects, because all of the cancelled input futures' producers won't be executed. commit 30c3eb6bf2e32ea0eb18cc82262966bf716884d6 Author: Till Rohrmann Date: 2018-07-22T18:20:53Z [hotfix] Fix checkstyle violations in FutureUtils commit 4f3ec0f88a2c27cbe7f33a82b09b44124e1b34c3 Author: Till Rohrmann Date: 2018-07-22T18:34:33Z [hotfix] Replace check state condition in Execution#tryAssignResource with if check Instead of risking an IllegalStateException it is better to check that the taskManagerLocationFuture has not been completed yet. If, then we also reject the assignment of the LogicalSlot to the Execution. That way, we don't risk that we don't release the slot in case of an exception in Execution#allocateAndAssignSlotForExecution. commit 0f8208642d3aa561148e9e7b95c736c932e9f034 Author: Till Rohrmann Date: 2018-07-22T18:43:44Z [hotfix] Fix checkstyle violations in ExecutionVertex commit 6ee88195fadea4badfad4a50ad832be5509d78a1 Author: Till Rohrmann Date: 2018-07-22T18:46:37Z [hotfix] Fix checkstyle violations in ExecutionJobVertex commit 8193243d61238c2787b8d9b35ac9681709c07ddb Author: Till Rohrmann Date: 2018-07-22T18:48:53Z [hotfix] Fix checkstyle violations in Execution commit c7fc51372abe3866a3972e78b590e3791b746c65 Author: Till Rohrmann Date: 2018-07-22T19:38:42Z [FLINK-9910][scheduling] Execution#scheduleForeExecution does not cancel slot future In order to properly give back an allocated slot to the SlotPool, one must not complete the result future of Execution#allocateAndAssignSlotForExecution. This commit changes the behaviour in Execution#scheduleForExecution accordingly. commit a58b755750ada229102d3d18cd89767ad7fe3b6d Author: Till Rohrmann Date: 2018-07-22T19:57:59Z [FLINK-9911][JM] Use SlotPoolGateway to call failAllocation Since the SlotPool is an actor, we must use the SlotPoolGateway to interact with the SlotPool. Otherwise, we might risk an inconsistent state since there are multiple threads modifying the component. commit 0772a52fe859bf00ec5dada2395e0296202ec469 Author: Till Rohrmann Date: 2018-07-22T20:11:13Z [FLINK-9917][JM] Remove superfluous lock from SlotSharingManager The SlotSharingManager is designed to be used by a single thread. Therefore, it is the responsibility of the caller to make sure that there is only a single thread at any given time accesssing this component. Consequently, the component does not need to be synchronized. commit 75fb9af3ce4245dea9f704e06a3acd87b8dcd8e0 Author: Till Rohrmann Date: 2018-07-22T20:58:18Z [FLINK-9912][JM] Release TaskExecutors if they have no slots registered at SlotPool This commit extends the SlotPools behaviour when failing an allocation by sending a notification message to the TaskExecutor about the freed slot. Moreover, it checks whether the affected TaskExecutor has more slots registered or not. In the latter case, the TaskExecutor's connection will be eagerly closed. ---- > Release TaskExecutors from SlotPool if all slots have been removed > ------------------------------------------------------------------ > > Key: FLINK-9912 > URL: https://issues.apache.org/jira/browse/FLINK-9912 > Project: Flink > Issue Type: Improvement > Components: Distributed Coordination > Affects Versions: 1.5.1, 1.6.0 > Reporter: Till Rohrmann > Assignee: Till Rohrmann > Priority: Major > Labels: pull-request-available > > Currently, it is possible to fail slot allocations in the {{SlotPool}}. Failing an allocation means that the slot is removed from the {{SlotPool}}. If we have removed all slots from a {{TaskExecutor}}, then we should also release/close the connection to this {{TaskExecutor}}. At the moment, this only happens via the heartbeats if the {{TaskExecutor}} has become unreachable. -- This message was sent by Atlassian JIRA (v7.6.3#76005)