Return-Path: X-Original-To: archive-asf-public-internal@cust-asf2.ponee.io Delivered-To: archive-asf-public-internal@cust-asf2.ponee.io Received: from cust-asf.ponee.io (cust-asf.ponee.io [163.172.22.183]) by cust-asf2.ponee.io (Postfix) with ESMTP id 2E79E200BEB for ; Wed, 28 Dec 2016 11:23:00 +0100 (CET) Received: by cust-asf.ponee.io (Postfix) id 2CC30160B2E; Wed, 28 Dec 2016 10:23:00 +0000 (UTC) Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by cust-asf.ponee.io (Postfix) with SMTP id 4BA08160B19 for ; Wed, 28 Dec 2016 11:22:59 +0100 (CET) Received: (qmail 39911 invoked by uid 500); 28 Dec 2016 10:22:58 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 39898 invoked by uid 99); 28 Dec 2016 10:22:58 -0000 Received: from arcas.apache.org (HELO arcas) (140.211.11.28) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 28 Dec 2016 10:22:58 +0000 Received: from arcas.apache.org (localhost [127.0.0.1]) by arcas (Postfix) with ESMTP id 570DB2C1F5A for ; Wed, 28 Dec 2016 10:22:58 +0000 (UTC) Date: Wed, 28 Dec 2016 10:22:58 +0000 (UTC) From: "Ying Zhang (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Comment Edited] (YARN-6031) Application recovery failed after disabling node label MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: 7bit X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 archived-at: Wed, 28 Dec 2016 10:23:00 -0000 [ https://issues.apache.org/jira/browse/YARN-6031?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15782566#comment-15782566 ] Ying Zhang edited comment on YARN-6031 at 12/28/16 10:22 AM: ------------------------------------------------------------- Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. [~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think:-) [~leftnoteasy], [~bibinchundatt] Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: {code:title=SchedulerUtils.java|borderStyle=solid} public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } {code} This is not exactly the same as what happens when submitting a job in normal case (i.e., not during recovery). While in normal case, when there is default node label expression defined for queue with node label disabled, the application will also get rejected due to invalid resource request even if it doesn't specify node label expression. I believe this will get fixed after YARN-4652 being addressed. was (Author: ying zhang): Uploaded a patch, which is based on [~leftnoteasy]'s comment on YARN-4465: swallow the InvalidResourceRequest exception when recovering, only fail the recovery for this application and print a error message, then let the rest of the recovery continue. [~sunilg], your suggestion also makes sense to me. Actually, the code change using your approach would be made at the same place as in this patch with small modification: in function recover(), inside the for loop, if the conditions are met, skip calling "recoverApplication" and log a message like "skip recover application ..." instead. Difference is that using this approach we'll always check for these conditions even though it might not be a normal case, while using the approach in the patch, we just need to react when the exception happens. I'm ok with each approach since the overhead is not that big. Let's see what others think:-) [~leftnoteasy], [~bibinchundatt] Just want to clarify. The current fact is (with or without this fix): application submitted with node label expression explicitly specified will fail during recovery, while application submitted without node label expression specified will succeed, no matter whether or not there is default node label expression for the target queue. This is due to the following code snippet, the calling for "checkQueueLabelInLabelManager" which will check if node label exists in node label manager(node label manager has no label at all if Node Label being disabled) has been skipped for recovery: {code:title=SchedulerUtils.java|borderStyle=solid} public static void normalizeAndValidateRequest(ResourceRequest resReq, Resource maximumResource, String queueName, YarnScheduler scheduler, boolean isRecovery, RMContext rmContext, QueueInfo queueInfo) throws InvalidResourceRequestException { ... ... SchedulerUtils.normalizeNodeLabelExpressionInRequest(resReq, queueInfo); if (!isRecovery) { validateResourceRequest(resReq, maximumResource, queueInfo, rmContext); // calling checkQueueLabelInLabelManager } {code} > Application recovery failed after disabling node label > ------------------------------------------------------ > > Key: YARN-6031 > URL: https://issues.apache.org/jira/browse/YARN-6031 > Project: Hadoop YARN > Issue Type: Bug > Components: scheduler > Affects Versions: 2.8.0 > Reporter: Ying Zhang > Assignee: Ying Zhang > Priority: Minor > Attachments: YARN-6031.001.patch > > > Here is the repro steps: > Enable node label, restart RM, configure CS properly, and run some jobs; > Disable node label, restart RM, and the following exception thrown: > {noformat} > Caused by: org.apache.hadoop.yarn.exceptions.InvalidLabelResourceRequestException: Invalid resource request, node label not enabled but request contains label expression > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:225) > at org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerUtils.normalizeAndValidateRequest(SchedulerUtils.java:248) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.validateAndCreateResourceRequest(RMAppManager.java:394) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.createAndPopulateNewRMApp(RMAppManager.java:339) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recoverApplication(RMAppManager.java:319) > at org.apache.hadoop.yarn.server.resourcemanager.RMAppManager.recover(RMAppManager.java:436) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager.recover(ResourceManager.java:1165) > at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$RMActiveServices.serviceStart(ResourceManager.java:574) > at org.apache.hadoop.service.AbstractService.start(AbstractService.java:193) > ... 10 more > {noformat} > During RM restart, application recovery failed due to that application had node label expression specified while node label has been disabled. -- This message was sent by Atlassian JIRA (v6.3.4#6332) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org