flink-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "ASF GitHub Bot (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (FLINK-4535) ResourceManager registration with TaskExecutor
Date Fri, 02 Sep 2016 12:23:20 GMT

    [ https://issues.apache.org/jira/browse/FLINK-4535?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15458376#comment-15458376

ASF GitHub Bot commented on FLINK-4535:

Github user tillrohrmann commented on a diff in the pull request:

    --- Diff: flink-runtime/src/main/java/org/apache/flink/runtime/rpc/resourcemanager/ResourceManager.java
    @@ -53,14 +56,23 @@
     public class ResourceManager extends RpcEndpoint<ResourceManagerGateway> {
     	private final Map<JobMasterGateway, InstanceID> jobMasterGateways;
    +	/** ResourceID and TaskExecutorGateway mapping relationship of registered taskExecutors
    +	private final Map<ResourceID, TaskExecutorGateway>  startedTaskExecutorGateways;
    +	/** TaskExecutorGateway and InstanceId mapping relationship of registered taskExecutors
    +	private final Map<TaskExecutorGateway, InstanceID> taskExecutorGateways;
    --- End diff --
    Wouldn't it make sense to group the `TaskExecutorGateway` and the `InstanceID` into a
`TaskExecutorRegistration` class which is stored under the resource ID? Then we would get
rid of a lookup when accessing the `InstanceID` given the resource ID.

> ResourceManager registration with TaskExecutor
> ----------------------------------------------
>                 Key: FLINK-4535
>                 URL: https://issues.apache.org/jira/browse/FLINK-4535
>             Project: Flink
>          Issue Type: Sub-task
>          Components: Cluster Management
>            Reporter: zhangjing
>            Assignee: zhangjing
> When TaskExecutor register at ResourceManager, it takes the following 3 input parameters:
> 1. resourceManagerLeaderId:  the fencing token for the ResourceManager leader which is
kept by taskExecutor who send the registration
> 2.  taskExecutorAddress: the address of taskExecutor
> 3. resourceID: The resource ID of the TaskExecutor that registers
> ResourceManager need to process the registration event based on the following steps:
> 1. Check whether input resourceManagerLeaderId is as same as the current leadershipSessionId
of resourceManager. If not, it means that maybe two or more resourceManager exists at the
same time, and current resourceManager is not the proper rm. so it  rejects or ignores the
> 2. Check whether exists a valid taskExecutor at the giving address by connecting to the
address. Reject the registration from invalid address.
> 3. Check whether it is a duplicate registration by input resourceId, reject the registration
> 4. Keep resourceID and taskExecutorGateway mapping relationships, And optionally keep
resourceID and container mapping relationships in yarn mode.
> 5. Create the connection between resourceManager and taskExecutor, and ensure its healthy
based on heartbeat rpc calls between rm and tm ?
> 6. Send registration successful ack to the taskExecutor.
> Discussion:
> Maybe we need import errorCode or several registration decline subclass to distinguish
the different causes of decline registration. 

This message was sent by Atlassian JIRA

View raw message