hadoop-yarn-dev mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kyungwan nam (JIRA)" <j...@apache.org>
Subject [jira] [Created] (YARN-7177) AvailableMB, AvailableVCores in the QueueMetrics is not correct when there are nodes whose node-label is not default
Date Fri, 08 Sep 2017 03:30:02 GMT
kyungwan nam created YARN-7177:
----------------------------------

             Summary: AvailableMB, AvailableVCores in the QueueMetrics is not correct when
there are nodes whose node-label is not default
                 Key: YARN-7177
                 URL: https://issues.apache.org/jira/browse/YARN-7177
             Project: Hadoop YARN
          Issue Type: Bug
    Affects Versions: 2.7.1
            Reporter: kyungwan nam


- default-node-label has total resource <memory:248832, vCores:144>
- ‘label1’ node-label has total resource <memory:248832, vCores:144>
- ‘large’ and ’small’ queues are respectively 50% and 50% of default-node-label capacity.
- ‘label1’ queue is 100% of ‘label1’ node-label capacity.
- an application using <memory:48128, vCores:13> is submitted to 'small' queue

we could see that AvailableMB, AvailableVCores are not correct as follows.

{code}
{
name: "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=small",
modelerType: "QueueMetrics,q0=root,q1=small",
tag.Queue: "root.small",
tag.Context: "yarn",
tag.Hostname: "host1.com",
running_0: 1,
running_60: 0,
running_300: 0,
running_1440: 0,
AppsSubmitted: 1,
AppsRunning: 1,
AppsPending: 0,
AppsCompleted: 0,
AppsKilled: 0,
AppsFailed: 0,
AllocatedMB: 48128,
AllocatedVCores: 13,
AllocatedContainers: 13,
AggregateContainersAllocated: 17,
AggregateContainersReleased: 4,
AvailableMB: 200704,
AvailableVCores: 131,
PendingMB: 0,
PendingVCores: 0,
PendingContainers: 0,
ReservedMB: 0,
ReservedVCores: 0,
ReservedContainers: 0,
ActiveUsers: 0,
ActiveApplications: 0
},
{code}

I think it should be calculated based on default-node-label as follows.
* AvailableMB = ( 248832 <default-node-label total resource> - 48128 <used resource>
) * 0.5 <small queue capacity>
* AvailableVCores = ( 144 <default-node-label total resource> - 13 <used resource>
) * 0.5 <small queue capacity>

we could see the another problem that absoluteUsedCapacity, usedCapacity are not correct in
the log.

{code}
2017-09-07 16:21:06,058 INFO  capacity.LeafQueue (LeafQueue.java:releaseResource(1762)) -
small used=<memory:48128, vCores:13> numContainers=13 user=test user-resources=<memory:48128,
vCores:13>
2017-09-07 16:21:06,058 INFO  capacity.LeafQueue (LeafQueue.java:completedContainer(1713))
- completedContainer container=Container: [ContainerId: container_e15_1504768325902_0001_01_000017,
NodeId: host2.com:45454, NodeHttpAddress: host2.com:8042, Resource: <memory:4096, vCores:1>,
Priority: 1073741826, Token: Token { kind: ContainerToken, service: 10.10.10.1:45454 }, ]
queue=small: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:48128, vCores:13>,
usedCapacity=0.19341564, absoluteUsedCapacity=0.09670782, numApps=1, numContainers=13 cluster=<memory:497664,
vCores:288>
{code}

Those are calculated based on total resources for all node-labels.
likewise, it should be default-node-label based as follows.
* usedCapacity = 48128 <used resource> / ( 248832 <default-node-label total resource>
* 0.5 <small queue capacity> = 0.38683127
* absoluteUsedCapacity = 48128 <used resource> / 248832 <default-node-label total
resource> = 0.19341563

it makes me confused.
but that’s not all. because the absoluteUsedCapacity is used in ProportionalCapacityPreemptionPolicy,
wrong value can cause a problem with regards to preemption.

{code}
  private TempQueue cloneQueues(CSQueue root, Resource clusterResources) {
    TempQueue ret;
    synchronized (root) {
      String queueName = root.getQueueName();
      float absUsed = root.getAbsoluteUsedCapacity();
      float absCap = root.getAbsoluteCapacity();
      float absMaxCap = root.getAbsoluteMaximumCapacity();
      boolean preemptionDisabled = root.getPreemptionDisabled();
{code}

it seems like this problem does not happen in the hadoop-2.8 or higher. 
but, we need to fix it for the hadoop-2.7.




--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-dev-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-dev-help@hadoop.apache.org


Mime
View raw message