From yarn-issues-return-166043-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org  Wed Apr  3 09:56:04 2019
Return-Path: <yarn-issues-return-166043-archive-asf-public=cust-asf.ponee.io@hadoop.apache.org>
X-Original-To: archive-asf-public@cust-asf.ponee.io
Delivered-To: archive-asf-public@cust-asf.ponee.io
Received: from mail.apache.org (hermes.apache.org [140.211.11.3])
	by mx-eu-01.ponee.io (Postfix) with SMTP id A048318067E
	for <archive-asf-public@cust-asf.ponee.io>; Wed,  3 Apr 2019 11:56:03 +0200 (CEST)
Received: (qmail 93574 invoked by uid 500); 3 Apr 2019 09:56:02 -0000
Mailing-List: contact yarn-issues-help@hadoop.apache.org; run by ezmlm
Precedence: bulk
List-Help: <mailto:yarn-issues-help@hadoop.apache.org>
List-Unsubscribe: <mailto:yarn-issues-unsubscribe@hadoop.apache.org>
List-Post: <mailto:yarn-issues@hadoop.apache.org>
List-Id: <yarn-issues.hadoop.apache.org>
Delivered-To: mailing list yarn-issues@hadoop.apache.org
Received: (qmail 93552 invoked by uid 99); 3 Apr 2019 09:56:02 -0000
Received: from mailrelay1-us-west.apache.org (HELO mailrelay1-us-west.apache.org) (209.188.14.139)
    by apache.org (qpsmtpd/0.29) with ESMTP; Wed, 03 Apr 2019 09:56:02 +0000
Received: from jira-lw-us.apache.org (unknown [207.244.88.139])
	by mailrelay1-us-west.apache.org (ASF Mail Server at mailrelay1-us-west.apache.org) with ESMTP id D097BE0F23
	for <yarn-issues@hadoop.apache.org>; Wed,  3 Apr 2019 09:56:01 +0000 (UTC)
Received: from jira-lw-us.apache.org (localhost [127.0.0.1])
	by jira-lw-us.apache.org (ASF Mail Server at jira-lw-us.apache.org) with ESMTP id 9250A2459B
	for <yarn-issues@hadoop.apache.org>; Wed,  3 Apr 2019 09:56:01 +0000 (UTC)
Date: Wed, 3 Apr 2019 09:56:01 +0000 (UTC)
From: "Szilard Nemeth (JIRA)" <jira@apache.org>
To: yarn-issues@hadoop.apache.org
Message-ID: <JIRA.13225087.1554047356000.177997.1554285361595@Atlassian.JIRA>
In-Reply-To: <JIRA.13225087.1554047356000@Atlassian.JIRA>
References: <JIRA.13225087.1554047356000@Atlassian.JIRA> <JIRA.13225087.1554047356308@jira-lw-us.apache.org>
Subject: [jira] [Assigned] (YARN-9430) Recovering containers does not check
 available resources on node
MIME-Version: 1.0
Content-Type: text/plain; charset=utf-8
Content-Transfer-Encoding: quoted-printable
X-JIRA-FingerPrint: 30527f35849b9dde25b450d4833f0394


     [ https://issues.apache.org/jira/browse/YARN-9430?page=3Dcom.atlassian=
.jira.plugin.system.issuetabpanels:all-tabpanel ]

Szilard Nemeth reassigned YARN-9430:
------------------------------------

    Assignee:     (was: Szilard Nemeth)

> Recovering containers does not check available resources on node
> ----------------------------------------------------------------
>
>                 Key: YARN-9430
>                 URL: https://issues.apache.org/jira/browse/YARN-9430
>             Project: Hadoop YARN
>          Issue Type: Bug
>            Reporter: Szilard Nemeth
>            Priority: Critical
>
> I have a testcase that checks if some GPU devices gone offline and recove=
ry happens, only the containers that fit into the node's resources will be =
recovered. Unfortunately, this is not the case: RM does not check available=
 resources on node during recovery.
> *Detailed explanation:*
> *Testcase:*=20
>  1. There are 2 nodes running NodeManagers
>  2. nvidia-smi is replaced with a fake bash script that reports 2 GPU dev=
ices per node, initially. This means 4 GPU devices in the cluster altogethe=
r.
>  3. RM / NM recovery is enabled
>  4. The test starts off a sleep job, requesting 4 containers, 1 GPU devic=
e for each (AM does not request GPUs)
>  5. Before restart, the fake bash script is adjusted to report 1 GPU devi=
ce per node (2 in the cluster) after restart.
>  6. Restart is initiated.
> =C2=A0
> *Expected behavior:*=20
>  After restart, only the AM and 2 normal containers should have been star=
ted, as there are only 2 GPU devices in the cluster.
> =C2=A0
> *Actual behaviour:*=20
>  AM + 4 containers are allocated, this is all containers started original=
ly with step 4.
> App id was: 1553977186701_0001
> *Logs*:
> =C2=A0
> {code:java}
> 2019-03-30 13:22:30,299 DEBUG org.apache.hadoop.yarn.server.resourcemanag=
er.rmapp.attempt.RMAppAttemptImpl: Processing event for appattempt_15539771=
86701_0001_000001 of type RECOVER
> 2019-03-30 13:22:30,366 INFO org.apache.hadoop.yarn.server.resourcemanage=
r.scheduler.fair.FairScheduler: Added Application Attempt appattempt_155397=
7186701_0001_000001 to scheduler from user: systest
>  2019-03-30 13:22:30,366 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.scheduler.fair.FairScheduler: appattempt_1553977186701_0001_000001 is r=
ecovering. Skipping notifying ATTEMPT_ADDED
>  2019-03-30 13:22:30,367 INFO org.apache.hadoop.yarn.server.resourcemanag=
er.rmapp.attempt.RMAppAttemptImpl: appattempt_1553977186701_0001_000001 Sta=
te change from NEW to LAUNCHED on event =3D RECOVER
> 2019-03-30 13:22:33,257 INFO org.apache.hadoop.yarn.server.resourcemanage=
r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553=
977186701_0001_01_000001, CreateTime: 1553977260732, Version: 0, State: RUN=
NING, Capability: <memory:1024, vCores:1>, Diagnostics: , ExitStatus: -1000=
, NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 INFO org.apache.hadoop.yarn.server.resourcemanage=
r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553=
977186701_0001_01_000004, CreateTime: 1553977272802, Version: 0, State: RUN=
NING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , E=
xitStatus: -1000, NodeLabelExpression: Priority: 0]
> 2019-03-30 13:22:33,275 DEBUG org.apache.hadoop.yarn.server.resourcemanag=
er.scheduler.fair.FSSchedulerNode: Assigned container container_e84_1553977=
186701_0001_01_000004 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> o=
n host snemeth-gpu-2.vpc.cloudera.com:8041, which has 2 containers, <memory=
:2048, vCores:2, yarn.io/gpu: 1> used and <memory:37252, vCores:6> availabl=
e after allocation
> 2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanage=
r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553=
977186701_0001_01_000005, CreateTime: 1553977272803, Version: 0, State: RUN=
NING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , E=
xitStatus: -1000, NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_000=
1_01_000005 of type RECOVER
>  2019-03-30 13:22:33,276 INFO org.apache.hadoop.yarn.server.resourcemanag=
er.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000005 =
Container Transitioned from NEW to RUNNING
>  2019-03-30 13:22:33,276 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.scheduler.fair.FSSchedulerNode: Assigned container container_e84_155397=
7186701_0001_01_000005 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> =
on host snemeth-gpu-2.vpc.cloudera.com:8041, which has 3 containers, <memor=
y:3072, vCores:3, yarn.io/gpu: 2> used and <memory:36228, vCores:5, yarn.io=
/gpu: -1> available after allocation
> 2019-03-30 13:22:33,279 INFO org.apache.hadoop.yarn.server.resourcemanage=
r.scheduler.AbstractYarnScheduler: Recovering container [container_e84_1553=
977186701_0001_01_000003, CreateTime: 1553977272166, Version: 0, State: RUN=
NING, Capability: <memory:1024, vCores:1, yarn.io/gpu: 1>, Diagnostics: , E=
xitStatus: -1000, NodeLabelExpression: Priority: 0]
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.rmcontainer.RMContainerImpl: Processing container_e84_1553977186701_000=
1_01_000003 of type RECOVER
>  2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanag=
er.rmcontainer.RMContainerImpl: container_e84_1553977186701_0001_01_000003 =
Container Transitioned from NEW to RUNNING
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.rmapp.RMAppImpl: Processing event for application_1553977186701_0001 of=
 type APP_RUNNING_ON_NODE
>  2019-03-30 13:22:33,280 DEBUG org.apache.hadoop.yarn.server.resourcemana=
ger.scheduler.fair.FSSchedulerNode: Assigned container container_e84_155397=
7186701_0001_01_000003 of capacity <memory:1024, vCores:1, yarn.io/gpu: 1> =
on host snemeth-gpu-3.vpc.cloudera.com:8041, which has 2 containers, <memor=
y:2048, vCores:2, yarn.io/gpu: 2> used and <memory:37252, vCores:6, yarn.io=
/gpu: -1> available after allocation
>  2019-03-30 13:22:33,280 INFO org.apache.hadoop.yarn.server.resourcemanag=
er.scheduler.SchedulerApplicationAttempt: SchedulerAttempt appattempt_15539=
77186701_0001_000001 is recovering container container_e84_1553977186701_00=
01_01_000003
> {code}
> =C2=A0
> There are multiple logs like this:
> {code:java}
> Assigned container container_e84_1553977186701_0001_01_000005 of capacity=
 <memory:1024, vCores:1, yarn.io/gpu: 1> on host snemeth-gpu-2.vpc.cloudera=
.com:8041, which has 3 containers, <memory:3072, vCores:3, yarn.io/gpu: 2> =
used and <memory:36228, vCores:5, yarn.io/gpu: -1> available after allocati=
on{code}
> *Note the -1 value for the yarn.io/gpu resource!*
> The issue lies in this method: [https://github.com/apache/hadoop/blob/e40=
e2d6ad5cbe782c3a067229270738b501ed27e/hadoop-yarn-project/hadoop-yarn/hadoo=
p-yarn-server/hadoop-yarn-server-resourcemanager/src/main/java/org/apache/h=
adoop/yarn/server/resourcemanager/scheduler/SchedulerNode.java#L179]
> The problem is that method deductUnallocatedResource does not check if th=
e resource of the container is subtracted from unallocated resource, the un=
allocated resource remains above zero.
>  Here is the ResourceManager call hierarchy for the method (from top to b=
ottom):
> {code:java}
> 1. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched=
uler#handle
> 2. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairSched=
uler#addNode
> 3. org.apache.hadoop.yarn.server.resourcemanager.scheduler.AbstractYarnSc=
heduler#recoverContainersOnNode
> 4. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#=
recoverContainer
> 5. org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSSchedul=
erNode#allocateContainer
> 6. org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode#=
allocateContainer(org.apache.hadoop.yarn.server.resourcemanager.rmcontainer=
.RMContainer, boolean)
> deduct is called here!{code}
> *Testcase that reproduces the issue:*=20
>  *Add this testcase to TestFSSchedulerNode:*
> =C2=A0
> {code:java}
> @Test
>  public void testRecovery() {
>  RMNode node =3D createNode();
>  FSSchedulerNode schedulerNode =3D new FSSchedulerNode(node, false);
> RMContainer container1 =3D createContainer(Resource.newInstance(4096, 4),
>  null);
>  RMContainer container2 =3D createContainer(Resource.newInstance(4096, 4)=
,
>  null);
> =20
>  schedulerNode.allocateContainer(container1);
>  schedulerNode.containerStarted(container1.getContainerId());
>  schedulerNode.allocateContainer(container2);
>  schedulerNode.containerStarted(container2.getContainerId());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  RMContainer container3 =3D createContainer(Resource.newInstance(1000, 1)=
,
>  null);
>  when(container3.getState()).thenReturn(RMContainerState.NEW);
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
> =20
>  schedulerNode.recoverContainer(container3);
> assertEquals("No resource should have been unallocated",
>  Resources.none(), schedulerNode.getUnallocatedResource());
>  assertEquals("All resources of node should have been allocated",
>  nodeResource, schedulerNode.getAllocatedResource());
>  }
> {code}
> =C2=A0
> =C2=A0
> *Result of testcase:*
> {code:java}
> java.lang.AssertionError: No resource should have been unallocated=20
> Expected :<memory:0, vCores:0>
> Actual :<memory:-1000, vCores:-1>{code}
> *IT'S IMMEDIATELY CLEAR THAT NOT ONLY GPU (OR OTHER RESOURCE TYPES), BUT =
ANY RESOURCES ARE AFFECTED BY THIS ISSUE!*
> =C2=A0
> *Possible fix:*=20
>  1. A condition needs to be introduced to check if there is enough resour=
ces on the node, we should proceed with the container's recovery only if th=
is is true.
>  2. An error log should be added. For a quick look, this is seemingly eno=
ugh=C2=A0so no exception is required, but this needs a more thorough invest=
igation and manual test on cluster!
> =C2=A0


--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org