hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Jian He (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-6031) Application recovery has failed when node label feature is turned off during RM recovery
Date Mon, 24 Jul 2017 22:08:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16099180#comment-16099180

Jian He commented on YARN-6031:

Ran into this patch when debugging, got few questions:
cc [~sunilg], [~Ying Zhang] 
1. Below code catches InvalidLabelResourceRequestException and assumes that the error is because
node-label becomes disabled, but the same InvalidLabelResourceRequestException can be thrown
for other reasons too, right ? in that case, the following logic becomes invalid. 

      amReqs = validateAndCreateResourceRequest(submissionContext, isRecovery);
    } catch (InvalidLabelResourceRequestException e) {
      // This can happen if the application had been submitted and run
      // with Node Label enabled but recover with Node Label disabled.
      // Thus there might be node label expression in the application's
      // resource requests. If this is the case, create RmAppImpl with
      // null amReq and reject the application later with clear error
      // message. So that the application can still be tracked by RM
      // after recovery and user can see what's going on and react accordingly.
      if (isRecovery &&
          !YarnConfiguration.areNodeLabelsEnabled(this.conf)) {
        if (LOG.isDebugEnabled()) {
          LOG.debug("AMResourceRequest is not created for " + applicationId
              + ". NodeLabel is not enabled in cluster, but AM resource "
              + "request contains a label expression.");
      } else {
        throw e;

2. Below code directly transitions app to failed by using a Rejected event.  The attempt state
is not moved to failed, it'll be stuck there ?
      if (labelExp != null &&
          !labelExp.equals(RMNodeLabelsManager.NO_LABEL)) {
        String message = "Failed to recover application " + appId
            + ". NodeLabel is not enabled in cluster, but AM resource request "
            + "contains a label expression.";
            new RMAppEvent(appId, RMAppEventType.APP_REJECTED, message));

3. Is it ok to let the app continue in this scenario, it's less disruptive to the apps. What's
the disadvantage if we let app continue ?

> Application recovery has failed when node label feature is turned off during RM recovery
> ----------------------------------------------------------------------------------------
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>             Fix For: 2.9.0, 3.0.0-alpha4, 2.8.2
>         Attachments: YARN-6031.001.patch, YARN-6031.002.patch, YARN-6031.003.patch, YARN-6031.004.patch,
YARN-6031.005.patch, YARN-6031.006.patch, YARN-6031.007.patch, YARN-6031-branch-2.8.001.patch
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid
resource request, node label not enabled but request contains label expression
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had node label
expression specified while node label has been disabled.

This message was sent by Atlassian JIRA

To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org

View raw message