hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery
Date Tue, 25 Jul 2017 06:57:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099594#comment-16099594
] 

Sunil G commented on YARN-6031:
-------------------------------

Hi [~jianhe]
Few doubts here,

bq.1. Below code catches InvalidLabelResourceRequestException and assumes that the error is
because node-label becomes disabled
This code snippet catches InvalidLabelResourceRequestException and suppress the same only
in case of recovery. If AMResourceRequest was stored in statestore, which means that {{validateAndCreateResourceRequest}}
was successful when app was submitted. Now during recovery, same will throw error only when
node labels are disables by conf. If its in store, we can assume that the am request is sane
enough. Could you please give more context where some other scenario can also throw same exception
during recovery.
On an another note, if not recovery {{throw e;}}, we throw same exception back.

bq.2. Below code directly transitions app to failed by using a Rejected event. The attempt
state is not moved to failed
In RMAppManager#createAndPopulateNewRMApp, app is just created whether its in submission/recovery
mode. Attempt is not yet created. Hence I think this wont be a problem.

bq.3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps.
Currently exception was thrown and RM was loosing the context of such an app. To record and
track such an app, we create the app nd move it to fail state. Hence recovery for other apps
will also continue and we will have context of this app as well.

> Application recovery has failed when node label feature is turned off during RM recovery
> ----------------------------------------------------------------------------------------
>
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>             Fix For: 2.9.0, 3.0.0-alpha4, 2.8.2
>
>         Attachments: YARN-6031.001.patch, YARN-6031.002.patch, YARN-6031.003.patch, YARN-6031.004.patch,
YARN-6031.005.patch, YARN-6031.006.patch, YARN-6031.007.patch, YARN-6031-branch-2.8.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid
resource request, node label not enabled but request contains label expression
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had node label
expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message