hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Daniel Templeton (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6031) Application recovery failed after disabling node label
Date Wed, 28 Dec 2016 16:36:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15783232#comment-15783232
] 

Daniel Templeton commented on YARN-6031:
----------------------------------------

I agree that {{-force-recovery}} could cause a significant information loss, but it's something
that the admin has to do explicitly, and it's only application information, so it's not the
end of the world.  With a {{-dump-application-information}} option, the admin has the choice
to either 1) look at each app that fails the recovery and decide whether to purge it or do
something else (like turn node labels back on), or 2) do a bulk purge with {{-force-recovery}}.

It might also be good to have another option, something like {{-dry-run-recovery}}, that would
tell the admin the IDs of all the applications that will fail during recovery so that she
doesn't have to keep doing them one at a time.  In fact, I could even see making that the
default behavior before failing the resource manager.

In any case, I don't think the approach proposed in this JIRA, to just ignore the failed app,
is going to work out.


> Application recovery failed after disabling node label
> ------------------------------------------------------
>
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>         Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid
resource request, node label not enabled but request contains label expression
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had node label
expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message