hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Subru Krishnan (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3673) Create a FailoverProxy for Federation services
Date Thu, 18 Aug 2016 18:59:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-3673?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15426990#comment-15426990

Subru Krishnan commented on YARN-3673:

Thanks [~jianhe] for reviewing the patch and your detailed feedback.

 bq. I think for this patch the FEDERATION_FAILOVER_ENABLED and FEDERATION_ENABLED flag are
not needed

There is a nuance here that unlike the other failover providers which are only initialized
when RM HA is enabled,{{FederationRMFailoverProxyProvider}} is used with both standalone RM
and RM HA as the main purpose is to bypass _yarn-site_ and obtain the RM connection address
from {{FederationStateStore}}. 
The FEDERATION_ENABLED flag indicates that irrespective of whether RM has HA enabled or not,
we should configure the {{FederationRMFailoverProxyProvider}}.
When we have both Federation & RM HA, the FEDERATION_FAILOVER_ENABLED indicates that we
should configure failover {{RetryPolicies}} as otherwise normal ones (like _retryForeverWithFixedSleep_)
would be initialized. This is needed because we override HA enabled in conf as otherwise that
triggers the code path of looking up RMs from xml using rmId index.

bq. Why do we need to store the tokens and then re-add the tokens?

This is to cover a corner case scenario which we hit during testing. IIRC, it was required
for the UnmanagedAMs (used for transparently spanning a job across clusters) to reconnect
to RM post failover. Without this, I was getting authentication error with missing token from
the RM primary RM.
Do you think I should leave it or add it when we reach the e2e testing phase?

bq. It appears to me the FederationStateStoreFacade is not needed for this patch

Since we are using in-memory store for the test case, you are right. The issue is with the
SQL based store (YARN-3663) which is what we want to deploy in production. We did some simulated
scale testing with 10s of thousands of nodes and there was performance degradation without
connection pooling. For pooling, you need Singleton as otherwise every proxy instance would
create its own private pool and that's what {{FederationStateStoreFacade}} provides.

bq. Test: IIUC, I don't actually see FederationRMFailoverProxyProvider configured in the test,
is it tested?

Yes! In the test, I create a proxy using {{FederationProxyProviderUtil}}:
ApplicationClientProtocol client = FederationProxyProviderUtil
	        .createRMProxy(conf, ApplicationClientProtocol.class, subClusterId,

And {{FederationProxyProviderUtil}} always sets {{FederationRMFailoverProxyProvider}} as the
failover provider:
// updating the conf with the refreshed RM addresses as proxy creations
	  // are based out of conf
	  private static void updateConf(Configuration conf,
	      SubClusterId subClusterId) {
	    conf.set(YarnConfiguration.FEDERATION_SUBCLUSTER_ID, subClusterId.getId());
	    conf.setBoolean(YarnConfiguration.FEDERATION_ENABLED, true);
	        FederationRMFailoverProxyProvider.class, RMFailoverProxyProvider.class);
	    // we will failover using FederationStateStore so set that and skip
	    // traditional HA
	    if (HAUtil.isHAEnabled(conf)) {
	      conf.setBoolean(YarnConfiguration.FEDERATION_FAILOVER_ENABLED, true);
	      conf.setBoolean(YarnConfiguration.RM_HA_ENABLED, false);


And I had to do multiple rounds of debugging and step through {{FederationRMFailoverProxyProvider}}
so many times while writing the tests, so most definitely it's being exercised :).


> Create a FailoverProxy for Federation services
> ----------------------------------------------
>                 Key: YARN-3673
>                 URL: https://issues.apache.org/jira/browse/YARN-3673
>             Project: Hadoop YARN
>          Issue Type: Sub-task
>          Components: nodemanager, resourcemanager
>            Reporter: Subru Krishnan
>            Assignee: Subru Krishnan
>         Attachments: YARN-3673-YARN-2915-v1.patch
> This JIRA proposes creating a failover proxy for Federation based on the cluster membership
information in the StateStore that can be used by both Router & AMRMProxy

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message