hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Botong Huang (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5531) UnmanagedAM pool manager for federating application across clusters
Date Tue, 16 May 2017 18:58:04 GMT

    [ https://issues.apache.org/jira/browse/YARN-5531?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16012936#comment-16012936
] 

Botong Huang commented on YARN-5531:
------------------------------------

Thanks [~kasha] for the detailed comments! I have addressed most of them in v11 patch, the
rest explanations are here: 

* 1 & 3.3.3. The reason we put it here is that Federation Interceptor (YARN-3666 and YARN-6511)
in NM will be using UAM. Putting it in Yarn Client will result in cyclic dependencies for
NM project. 

* 2.1-2 This is generalized from the Federation use case, where for one application we enforce
the same applicationId in all sub-clusters (RMs in different sub-clusters use different epochs,
so that their app Id won't overlap). uamID (sub-cluster ID really) is used to identify the
UAMs. In v11 patch, I made the input attemptId becomes optional. If not supplied, the UAM
will ask for an appID from RM first. In general, attempt id can be used as the uamID. 

* 2.5.1 Parallel kill is necessary for performance reason. In federation, the service stop
of UAM pool is in the code path of Federation Interceptor shutdown, potentially blocking the
application finish event in the NM where AM is running. Furthermore, when we try to kill the
UAMs, RM in some sub-clusters might be failing over, which takes several minutes to come back.
Sequential kill can be bad. 

* 2.5.5 Because of the above reason, I prefer not to retry here. One option is to throw the
exception past this stop call, the user can handle the exception and retry if needed. In Federation
Interceptor's case, we can simply catch it, log as warning and move on. What do you think?

* 2.8.2 & 3.1 & 3.6.2 As mentioned with [~subru] earlier, this UAM pool and UAM is
more of a library for the actual UAM. The interface UAM pool expose to user is similar to
{{ApplicationMasterProtocol}} (registerAM, allocate and finishAM), user is supposed to act
like an AM and heartbeat to us. So for {{finishApplicationMaster}}, we abide by the protocol,
if the UAM is still registered after the finishAM call, the user should retry. 

* 3.3.1 & 3.3.4 The launch UAM code is indeed a bit messy, I've cleaned up the code in
v11. I merged the two monitor methods, might look a bit complex, can revert if needed. 

* 3.5.1 AsyncCallback works nicely in here. I think dispatcher can work as well, but I'd prefer
to do that in another JIRA if needed. 

* 3.7.2-3 This is a corner use case for Federation. In federation interceptor, we handle the
UAMs asynchronously. UAM is created the first time AM try to ask for resource from certain
sub-cluster. The register, allocate and finish calls for UAM are all triggered by heartbeats
from AM. This means that all three calls are triggered asynchronously. For instance, while
the register call for UAM is still pending (say because the UAM RM is falling over and the
register call is blocked for five minutes), we need to allow the allocate calls to come in
without exception and buffer them. Once the register succeeds late, we should be able to move
on from there. 





> UnmanagedAM pool manager for federating application across clusters
> -------------------------------------------------------------------
>
>                 Key: YARN-5531
>                 URL: https://issues.apache.org/jira/browse/YARN-5531
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Subru Krishnan
>            Assignee: Botong Huang
>         Attachments: YARN-5531-YARN-2915.v10.patch, YARN-5531-YARN-2915.v11.patch, YARN-5531-YARN-2915.v1.patch,
YARN-5531-YARN-2915.v2.patch, YARN-5531-YARN-2915.v3.patch, YARN-5531-YARN-2915.v4.patch,
YARN-5531-YARN-2915.v5.patch, YARN-5531-YARN-2915.v6.patch, YARN-5531-YARN-2915.v7.patch,
YARN-5531-YARN-2915.v8.patch, YARN-5531-YARN-2915.v9.patch
>
>
> One of the main tenets the YARN Federation is to *transparently* scale applications across
multiple clusters. This is achieved by running UAMs on behalf of the application on other
clusters. This JIRA tracks the addition of a UnmanagedAM pool manager for federating application
across clusters which will be used the FederationInterceptor (YARN-3666) which is part of
the AMRMProxy pipeline introduced in YARN-2884.



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message