hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jun Gong (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-5333) Some recovered apps are put into default queue when RM HA
Date Tue, 02 Aug 2016 10:34:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15403741#comment-15403741

Jun Gong commented on YARN-5333:

Thanks [~rohithsharma], [~jianhe] for the review and comments!

bq. 1. Should private boolean isTransitingToActive = false; is volatile?
Yes, it needs be volatile. I'll update it.

2. Since none of the refreshXXX methods are synchronized, patch introduces a concurrency issue.
If there is an explicit admin call for refreshing at the time of transitionToActive, then
checkRMStatus will be executed for other admin calls. Until RM transition-to-active completely,
explicit admin commands should not allowed to refresh. I think, we should incorporate similar
to refreshAdminAcl method.
How about adding {{synchronized}} to each refresh functions? It avoids adding more logic.
When admin command comes, we could just call corresponding refresh functions. I think it does
not matter to call refresh function many times.

bq. 3. I think flag checkRMHAState can be passed to method checkRMStatus.
I was thinking it. If adding checkRMHAState to checkRMStatus, we need add this parameter(checkRMHAState)
to all refresh functions too(which is similar to refreshAdminAcl), there are a lot of places
that call refresh functions. It might be better to just add a check before checkRMStatus?

bq. I think if you can simulate test for generally instead of specific to fair scheduler,
this test can be moved to class TestRMHA. There is already test TestRMHA#testTransitionedToActiveRefreshFail,
probable the same test can be changed?
Thanks. I'll update the test case.

Instead of reusing the existing refreshAll method, I checked each refresh method, it should
be cleaner to just create a new method which includes all necessary reconfig steps. This also
avoids unnecessary audit logs, acl checks.
Yes, it will be more clear to add a new method to include all reconfig steps. My doubt is
that there will be two places that do similar reconfig things(the one is in refresh functions,
the other is in the new added method). Then we need to modify both places if there is some
change for one of them.

> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>                 Key: YARN-5333
>                 URL: https://issues.apache.org/jira/browse/YARN-5333
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-5333.01.patch, YARN-5333.02.patch, YARN-5333.03.patch, YARN-5333.04.patch,
> Enable RM HA and use FairScheduler, {{yarn.scheduler.fair.allow-undeclared-pools}} is
set to false, {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file {{etc/hadoop/fair-scheduler.xml}}, then
add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover apps.
> However the new active RM will put recovered apps into default queue because it might
have not loaded the new {{fair-scheduler.xml}}. We need call {{initScheduler}} before start
active services or bring {{refreshAll()}} in front of {{rm.transitionToActive()}}. *It seems
it is also important for other scheduler*.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message