hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Ying Zhang (JIRA)" <j...@apache.org>
Subject [jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label
Date Wed, 28 Dec 2016 10:22:58 GMT

    [ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782566#comment-15782566
] 

Ying Zhang edited comment on YARN-6031 at 12/28/16 10:22 AM:
-------------------------------------------------------------

Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:   swallow the InvalidResourceRequest
exception when recovering, only fail the recovery for this application and print a error message,
then let the rest of the recovery continue.

[~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach
would be made at the same place as in this patch with small modification: in function recover(),
inside the for loop,  if the conditions are met, skip calling "recoverApplication" and log
a message like "skip recover application ..." instead. Difference is that using this approach
we'll always check for these conditions even though it might not be a normal case, while using
the approach in the patch, we just need to react when the exception happens. I'm ok with each
approach since the overhead is not that big.

Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]

Just want to clarify. The current fact is (with or without this fix): application submitted
with node label expression explicitly specified will fail during recovery, while application
submitted without node label expression specified will succeed, no matter whether or not there
is default node label expression for the target queue. This is due to the following code snippet,
the calling for "checkQueueLabelInLabelManager"  which will check if node label exists in
node label manager(node label manager has no label at all if Node Label being disabled) has
been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
  public static void normalizeAndValidateRequest(ResourceRequest resReq,
      Resource maximumResource, String queueName, YarnScheduler scheduler,
      boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
      throws InvalidResourceRequestException {
    ... ...

    SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
    if (!isRecovery) {
      validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  // calling
checkQueueLabelInLabelManager
    }
{code}

This is not exactly the same as what happens when submitting a job in normal case (i.e., not
during recovery). While in normal case, when there is default node label expression defined
for queue with node label disabled, the application will also get rejected due to invalid
resource request even if it doesn't specify node label expression. I believe this will get
fixed after YARN-4652 being addressed.
 


was (Author: ying zhang):
Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465:   swallow the InvalidResourceRequest
exception when recovering, only fail the recovery for this application and print a error message,
then let the rest of the recovery continue.

[~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach
would be made at the same place as in this patch with small modification: in function recover(),
inside the for loop,  if the conditions are met, skip calling "recoverApplication" and log
a message like "skip recover application ..." instead. Difference is that using this approach
we'll always check for these conditions even though it might not be a normal case, while using
the approach in the patch, we just need to react when the exception happens. I'm ok with each
approach since the overhead is not that big.

Let's see what others think:-) [~leftnoteasy], [~bibinchundatt]

Just want to clarify. The current fact is (with or without this fix): application submitted
with node label expression explicitly specified will fail during recovery, while application
submitted without node label expression specified will succeed, no matter whether or not there
is default node label expression for the target queue. This is due to the following code snippet,
the calling for "checkQueueLabelInLabelManager"  which will check if node label exists in
node label manager(node label manager has no label at all if Node Label being disabled) has
been skipped for recovery:
{code:title=SchedulerUtils.java|borderStyle=solid}
  public static void normalizeAndValidateRequest(ResourceRequest resReq,
      Resource maximumResource, String queueName, YarnScheduler scheduler,
      boolean isRecovery, RMContext rmContext, QueueInfo queueInfo)
      throws InvalidResourceRequestException {
    ... ...

    SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo);
    if (!isRecovery) {
      validateResourceRequest(resReq, maximumResource, queueInfo, rmContext);  // calling
checkQueueLabelInLabelManager
    }
{code}

 

> Application recovery failed after disabling node label
> ------------------------------------------------------
>
>                 Key: YARN-6031
>                 URL: https://issues.apache.org/jira/browse/YARN-6031
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.8.0
>            Reporter: Ying Zhang
>            Assignee: Ying Zhang
>            Priority: Minor
>         Attachments: YARN-6031.001.patch
>
>
> Here is the repro steps:
> Enable node label, restart RM, configure CS properly, and run some jobs;
> Disable node label, restart RM, and the following exception thrown:
> {noformat}
> Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid
resource request, node label not enabled but request contains label expression
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319)
>         at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574)
>         at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193)
>         ... 10 more
> {noformat}
> During RM restart, application recovery failed due to that application had node label
expression specified while node label has been disabled.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message