hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-5333) Some recovered apps are put into default queue when RM HA
Date Thu, 21 Jul 2016 11:47:20 GMT

    [ https://issues.apache.org/jira/browse/YARN-5333?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15387552#comment-15387552
] 

Sunil G edited comment on YARN-5333 at 7/21/16 11:47 AM:
---------------------------------------------------------

Thanks [~hex108]
Yes, we are recovering apps (by calling startActiveServices) first and then only trying to
do refreshQueues from {{AdminService#transitionToActive}}. So apps on newly added queue will
fail during recovery.

bq.when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail
As per your patch if {{refreshQueues}} raise an exception may be due to a corrupted conf file,
then we can see RMs will toggle. YARN-3893 fixed this and I made the similar suggestion (I
suggested refreshAll) as given in this patch now. Pls refer my [comment|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14703329&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14703329].
[~rohithsharma] helped to point out a possible [problem|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14708470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14708470]
with this approach. 

I agree that its a pblm in CS given we are using normal conf file. So If we could handle the
exception from {{refreshQeueues}} which can be called prior to {{rm.transitionToActive()}}
and *do fail fast directly*, then we can somehow manage both issues. [~rohithsharma], [~jianhe]
Thoughts?


was (Author: sunilg):
Thanks [~hex108]
Yes, we are recovering apps (by calling startActiveServices) first and then only trying to
do refreshQueues from {{AdminService#transitionToActive}}. So apps on newly added queue will
fail during recovery.

bq.when capacity-scheduler.xml is corrupted, running {{refreshQueues }} will just fail
If {{refreshQueues}} is not called, we can see RMs will toggle. YARN-3893 fixed this and I
made the similar suggestion (I suggested refreshAll) as given in this patch now. Pls refer
my [comment|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14703329&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14703329].
[~rohithsharma] helped to point out a possible [problem|https://issues.apache.org/jira/browse/YARN-3893?focusedCommentId=14708470&page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel#comment-14708470]
with this approach. 

I agree that its a pblm in CS given we are using normal conf file. So If we could handle the
exception from {{refreshQeueues}} which can be called prior to {{rm.transitionToActive()}}
and *do fail fast directly*, then we can somehow manage both issues. [~rohithsharma], [~jianhe]
Thoughts?

> Some recovered apps are put into default queue when RM HA
> ---------------------------------------------------------
>
>                 Key: YARN-5333
>                 URL: https://issues.apache.org/jira/browse/YARN-5333
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Jun Gong
>            Assignee: Jun Gong
>         Attachments: YARN-5333.01.patch, YARN-5333.02.patch, YARN-5333.03.patch
>
>
> Enable RM HA and use FairScheduler, {{yarn.scheduler.fair.allow-undeclared-pools}} is
set to false, {{yarn.scheduler.fair.user-as-default-queue}} is set to false.
> Reproduce steps:
> 1. Start two RMs.
> 2. After RMs are running, change both RM's file {{etc/hadoop/fair-scheduler.xml}}, then
add some queues.
> 3. Submit some apps to the new added queues.
> 4. Stop the active RM, then the standby RM will transit to active and recover apps.
> However the new active RM will put recovered apps into default queue because it might
have not loaded the new {{fair-scheduler.xml}}. We need call {{initScheduler}} before start
active services or bring {{refreshAll()}} in front of {{rm.transitionToActive()}}. *It seems
it is also important for other scheduler*.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message