Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 9AEB3200B78 for ; Fri, 2 Sep 2016 14:23:22 +0200 (CEST) Received: by cust-asf.ponee.io (Postfix) id 99599160ACB; Fri, 2 Sep 2016 12:23:22 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id E180C160AAB for ; Fri, 2 Sep 2016 14:23:21 +0200 (CEST) Received: (qmail 85478 invoked by uid 500); 2 Sep 2016 12:23:21 -0000 Mailing-List: contact issues-help@flink.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Reply-To: dev@flink.apache.org Delivered-To: mailing list issues@flink.apache.org Received: (qmail 85455 invoked by uid 99); 2 Sep 2016 12:23:21 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Fri, 02 Sep 2016 12:23:21 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id EBA292C1B80 for ; Fri, 2 Sep 2016 12:23:20 +0000 (UTC) Date: Fri, 2 Sep 2016 12:23:20 +0000 (UTC) From: "ASF GitHub Bot (JIRA)" To: issues@flink.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Commented] (FLINK-4535) ResourceManager registration with TaskExecutor MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Fri, 02 Sep 2016 12:23:22 -0000 [ https://issues.apache.org/jira/browse/FLINK-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458376#comment-15458376 ] ASF GitHub Bot commented on FLINK-4535: --------------------------------------- Github user tillrohrmann commented on a diff in the pull request: https://github.com/apache/flink/pull/2451#discussion_r77337149 --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/rpc/resourcemanager/ResourceManager.java --- @@ -53,14 +56,23 @@ */ public class ResourceManager extends RpcEndpoint { private final Map jobMasterGateways; + + /** ResourceID and TaskExecutorGateway mapping relationship of registered taskExecutors */ + private final Map startedTaskExecutorGateways; + + /** TaskExecutorGateway and InstanceId mapping relationship of registered taskExecutors */ + private final Map taskExecutorGateways; --- End diff -- Wouldn't it make sense to group the `TaskExecutorGateway` and the `InstanceID` into a `TaskExecutorRegistration` class which is stored under the resource ID? Then we would get rid of a lookup when accessing the `InstanceID` given the resource ID. > ResourceManager registration with TaskExecutor > ---------------------------------------------- > > Key: FLINK-4535 > URL: https://issues.apache.org/jira/browse/FLINK-4535 > Project: Flink > Issue Type: Sub-task > Components: Cluster Management > Reporter: zhangjing > Assignee: zhangjing > > When TaskExecutor register at ResourceManager, it takes the following 3 input parameters: > 1. resourceManagerLeaderId: the fencing token for the ResourceManager leader which is kept by taskExecutor who send the registration > 2. taskExecutorAddress: the address of taskExecutor > 3. resourceID: The resource ID of the TaskExecutor that registers > ResourceManager need to process the registration event based on the following steps: > 1. Check whether input resourceManagerLeaderId is as same as the current leadershipSessionId of resourceManager. If not, it means that maybe two or more resourceManager exists at the same time, and current resourceManager is not the proper rm. so it rejects or ignores the registration. > 2. Check whether exists a valid taskExecutor at the giving address by connecting to the address. Reject the registration from invalid address. > 3. Check whether it is a duplicate registration by input resourceId, reject the registration > 4. Keep resourceID and taskExecutorGateway mapping relationships, And optionally keep resourceID and container mapping relationships in yarn mode. > 5. Create the connection between resourceManager and taskExecutor, and ensure its healthy based on heartbeat rpc calls between rm and tm ? > 6. Send registration successful ack to the taskExecutor. > Discussion: > Maybe we need import errorCode or several registration decline subclass to distinguish the different causes of decline registration. -- This message was sent by Atlassian JIRA (v6.3.4#6332)