From yarn-issues-return-166043-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org Wed Apr 3 09:56:04 2019 Return-Path: X-Original-To: archive-asf-public@cust-asf.ponee.io Delivered-To: archive-asf-public@cust-asf.ponee.io Received: from mail.apache.org (hermes.apache.org [140.211.11.3]) by mx-eu-01.ponee.io (Postfix) with SMTP id A048318067E for ; Wed, 3 Apr 2019 11:56:03 +0200 (CEST) Received: (qmail 93574 invoked by uid 500); 3 Apr 2019 09:56:02 -0000 Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm Precedence: bulk List-Help: List-Unsubscribe: List-Post: List-Id: Delivered-To: mailing list yarn-issues@hadoop.apache.org Received: (qmail 93552 invoked by uid 99); 3 Apr 2019 09:56:02 -0000 Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139) by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2019 09:56:02 +0000 Received: from jira-lw-us.apache.org (unknown [207.244.88.139]) by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D097BE0F23 for ; Wed, 3 Apr 2019 09:56:01 +0000 (UTC) Received: from jira-lw-us.apache.org (localhost [127.0.0.1]) by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9250A2459B for ; Wed, 3 Apr 2019 09:56:01 +0000 (UTC) Date: Wed, 3 Apr 2019 09:56:01 +0000 (UTC) From: "Szilard Nemeth (JIRA)" To: yarn-issues@hadoop.apache.org Message-ID: In-Reply-To: References: Subject: [jira] [Assigned] (YARN-9430) Recovering containers does not check available resources on node MIME-Version: 1.0 Content-Type: text/plain; charset=utf-8 Content-Transfer-Encoding: quoted-printable X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394 [ https://issues.apache.org/jira/browse/YARN-9430?page=3Dcom.atlassian= .jira.plugin.system.issuetabpanels:all-tabpanel ] Szilard Nemeth reassigned YARN-9430: ------------------------------------ Assignee: (was: Szilard Nemeth) > Recovering containers does not check available resources on node > ---------------------------------------------------------------- > > Key: YARN-9430 > URL: https://issues.apache.org/jira/browse/YARN-9430 > Project: Hadoop YARN > Issue Type: Bug > Reporter: Szilard Nemeth > Priority: Critical > > I have a testcase that checks if some GPU devices gone offline and recove= ry happens, only the containers that fit into the node's resources will be = recovered. Unfortunately, this is not the case: RM does not check available= resources on node during recovery. > *Detailed explanation:* > *Testcase:*=20 > 1. There are 2 nodes running NodeManagers > 2. nvidia-smi is replaced with a fake bash script that reports 2 GPU dev= ices per node, initially. This means 4 GPU devices in the cluster altogethe= r. > 3. RM / NM recovery is enabled > 4. The test starts off a sleep job, requesting 4 containers, 1 GPU devic= e for each (AM does not request GPUs) > 5. Before restart, the fake bash script is adjusted to report 1 GPU devi= ce per node (2 in the cluster) after restart. > 6. Restart is initiated. > =C2=A0 > *Expected behavior:*=20 > After restart, only the AM and 2 normal containers should have been star= ted, as there are only 2 GPU devices in the cluster. > =C2=A0 > *Actual behaviour:*=20 > AM + 4 containers are allocated, this is all containers started original= ly with step 4. > App id was: 1553977186701_0001 > *Logs*: > =C2=A0 > {code:java} > 2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanag= er.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_15539771= 86701_0001_000001 of type RECOVER > 2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanage= r.scheduler.fair.FairScheduler: Added Application Attempt appattempt_155397= 7186701_0001_000001 to scheduler from user: systest > 2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_000001 is r= ecovering. Skipping notifying ATTEMPT_ADDED > 2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanag= er.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_000001 Sta= te change from NEW to LAUNCHED on event =3D RECOVER > 2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanage= r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553= 977186701_0001_01_000001, CreateTime: 1553977260732, Version: 0, State: RUN= NING, Capability: , Diagnostics: , ExitStatus: -1000= , NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanage= r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553= 977186701_0001_01_000004, CreateTime: 1553977272802, Version: 0, State: RUN= NING, Capability: , Diagnostics: , E= xitStatus: -1000, NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanag= er.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977= 186701_0001_01_000004 of capacity o= n host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, used and availabl= e after allocation > 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanage= r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553= 977186701_0001_01_000005, CreateTime: 1553977272803, Version: 0, State: RUN= NING, Capability: , Diagnostics: , E= xitStatus: -1000, NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_000= 1_01_000005 of type RECOVER > 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanag= er.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000005 = Container Transitioned from NEW to RUNNING > 2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.scheduler.fair.FSSchedulerNode: Assigned container container_e84_155397= 7186701_0001_01_000005 of capacity = on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, used and available after allocation > 2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanage= r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553= 977186701_0001_01_000003, CreateTime: 1553977272166, Version: 0, State: RUN= NING, Capability: , Diagnostics: , E= xitStatus: -1000, NodeLabelExpression: Priority: 0] > 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_000= 1_01_000003 of type RECOVER > 2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanag= er.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000003 = Container Transitioned from NEW to RUNNING > 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of= type APP_RUNNING_ON_NODE > 2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana= ger.scheduler.fair.FSSchedulerNode: Assigned container container_e84_155397= 7186701_0001_01_000003 of capacity = on host snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, used and available after allocation > 2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanag= er.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_15539= 77186701_0001_000001 is recovering container container_e84_1553977186701_00= 01_01_000003 > {code} > =C2=A0 > There are multiple logs like this: > {code:java} > Assigned container container_e84_1553977186701_0001_01_000005 of capacity= on host snemeth-gpu-2.vpc.cloudera= .com:8041, which has 3 containers, = used and available after allocati= on{code} > *Note the -1 value for the yarn.io/gpu resource!* > The issue lies in this method: [https://github.com/apache/hadoop/blob/e40= e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoo= p-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/h= adoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179] > The problem is that method deductUnallocatedResource does not check if th= e resource of the container is subtracted from unallocated resource, the un= allocated resource remains above zero. > Here is the ResourceManager call hierarchy for the method (from top to b= ottom): > {code:java} > 1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched= uler#handle > 2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched= uler#addNode > 3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnSc= heduler#recoverContainersOnNode > 4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#= recoverContainer > 5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedul= erNode#allocateContainer > 6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#= allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer= .RMContainer, boolean) > deduct is called here!{code} > *Testcase that reproduces the issue:*=20 > *Add this testcase to TestFSSchedulerNode:* > =C2=A0 > {code:java} > @Test > public void testRecovery() { > RMNode node =3D createNode(); > FSSchedulerNode schedulerNode =3D new FSSchedulerNode(node, false); > RMContainer container1 =3D createContainer(Resource.newInstance(4096, 4), > null); > RMContainer container2 =3D createContainer(Resource.newInstance(4096, 4)= , > null); > =20 > schedulerNode.allocateContainer(container1); > schedulerNode.containerStarted(container1.getContainerId()); > schedulerNode.allocateContainer(container2); > schedulerNode.containerStarted(container2.getContainerId()); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > RMContainer container3 =3D createContainer(Resource.newInstance(1000, 1)= , > null); > when(container3.getState()).thenReturn(RMContainerState.NEW); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > =20 > schedulerNode.recoverContainer(container3); > assertEquals("No resource should have been unallocated", > Resources.none(), schedulerNode.getUnallocatedResource()); > assertEquals("All resources of node should have been allocated", > nodeResource, schedulerNode.getAllocatedResource()); > } > {code} > =C2=A0 > =C2=A0 > *Result of testcase:* > {code:java} > java.lang.AssertionError: No resource should have been unallocated=20 > Expected : > Actual :{code} > *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT = ANY RESOURCES ARE AFFECTED BY THIS ISSUE!* > =C2=A0 > *Possible fix:*=20 > 1. A condition needs to be introduced to check if there is enough resour= ces on the node, we should proceed with the container's recovery only if th= is is true. > 2. An error log should be added. For a quick look, this is seemingly eno= ugh=C2=A0so no exception is required, but this needs a more thorough invest= igation and manual test on cluster! > =C2=A0 -- This message was sent by Atlassian JIRA (v7.6.3#76005) --------------------------------------------------------------------- To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org For additional commands, e-mail: yarn-issues-help@hadoop.apache.org