hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "kyungwan nam (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-7177) AvailableMB, AvailableVCores in the QueueMetrics is not correct when there are nodes whose node-label is not default
Date Fri, 08 Sep 2017 09:37:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-7177?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16158390#comment-16158390
] 

kyungwan nam commented on YARN-7177:
------------------------------------

I have checked that absoluteUsedCapacity is not used in ProportionalCapacityPreemptionPolicy
by YARN-3849, which is included in hadoop-2.7.3.
so, there is no preemption problem in hadoop-2.7.3 or higher.

> AvailableMB, AvailableVCores in the QueueMetrics is not correct when there are nodes
whose node-label is not default
> --------------------------------------------------------------------------------------------------------------------
>
>                 Key: YARN-7177
>                 URL: https://issues.apache.org/jira/browse/YARN-7177
>             Project: Hadoop YARN
>          Issue Type: Bug
>    Affects Versions: 2.7.1
>            Reporter: kyungwan nam
>         Attachments: YARN-7177-branch-2.7.001.patch
>
>
> - default-node-label has total resource <memory:248832, vCores:144>
> - ‘label1’ node-label has total resource <memory:248832, vCores:144>
> - ‘large’ and ’small’ queues are respectively 50% and 50% of default-node-label
capacity.
> - ‘label1’ queue is 100% of ‘label1’ node-label capacity.
> - an application using <memory:48128, vCores:13> is submitted to 'small' queue
> we could see that AvailableMB, AvailableVCores are not correct as follows.
> {code}
> {
> name: "Hadoop:service=ResourceManager,name=QueueMetrics,q0=root,q1=small",
> modelerType: "QueueMetrics,q0=root,q1=small",
> tag.Queue: "root.small",
> tag.Context: "yarn",
> tag.Hostname: "host1.com",
> running_0: 1,
> running_60: 0,
> running_300: 0,
> running_1440: 0,
> AppsSubmitted: 1,
> AppsRunning: 1,
> AppsPending: 0,
> AppsCompleted: 0,
> AppsKilled: 0,
> AppsFailed: 0,
> AllocatedMB: 48128,
> AllocatedVCores: 13,
> AllocatedContainers: 13,
> AggregateContainersAllocated: 17,
> AggregateContainersReleased: 4,
> AvailableMB: 200704,
> AvailableVCores: 131,
> PendingMB: 0,
> PendingVCores: 0,
> PendingContainers: 0,
> ReservedMB: 0,
> ReservedVCores: 0,
> ReservedContainers: 0,
> ActiveUsers: 0,
> ActiveApplications: 0
> },
> {code}
> I think it should be calculated based on default-node-label as follows.
> * AvailableMB = ( 248832 <default-node-label total resource> - 48128 <used resource>
) * 0.5 <small queue capacity>
> * AvailableVCores = ( 144 <default-node-label total resource> - 13 <used resource>
) * 0.5 <small queue capacity>
> we could see the another problem that absoluteUsedCapacity, usedCapacity are not correct
in the log.
> {code}
> 2017-09-07 16:21:06,058 INFO  capacity.LeafQueue (LeafQueue.java:releaseResource(1762))
- small used=<memory:48128, vCores:13> numContainers=13 user=test user-resources=<memory:48128,
vCores:13>
> 2017-09-07 16:21:06,058 INFO  capacity.LeafQueue (LeafQueue.java:completedContainer(1713))
- completedContainer container=Container: [ContainerId: container_e15_1504768325902_0001_01_000017,
NodeId: host2.com:45454, NodeHttpAddress: host2.com:8042, Resource: <memory:4096, vCores:1>,
Priority: 1073741826, Token: Token { kind: ContainerToken, service: 10.10.10.1:45454 }, ]
queue=small: capacity=0.5, absoluteCapacity=0.5, usedResources=<memory:48128, vCores:13>,
usedCapacity=0.19341564, absoluteUsedCapacity=0.09670782, numApps=1, numContainers=13 cluster=<memory:497664,
vCores:288>
> {code}
> Those are calculated based on total resources for all node-labels.
> likewise, it should be default-node-label based as follows.
> * usedCapacity = 48128 <used resource> / ( 248832 <default-node-label total
resource> * 0.5 <small queue capacity> = 0.38683127
> * absoluteUsedCapacity = 48128 <used resource> / 248832 <default-node-label
total resource> = 0.19341563
> it makes me confused.
> but that’s not all. because the absoluteUsedCapacity is used in ProportionalCapacityPreemptionPolicy,
wrong value can cause a problem with regards to preemption.
> {code}
>   private TempQueue cloneQueues(CSQueue root, Resource clusterResources) {
>     TempQueue ret;
>     synchronized (root) {
>       String queueName = root.getQueueName();
>       float absUsed = root.getAbsoluteUsedCapacity();
>       float absCap = root.getAbsoluteCapacity();
>       float absMaxCap = root.getAbsoluteMaximumCapacity();
>       boolean preemptionDisabled = root.getPreemptionDisabled();
> {code}
> it seems like this problem does not happen in the hadoop-2.8 or higher. 
> but, we need to fix it for the hadoop-2.7.



--
This message was sent by Atlassian JIRA
(v6.4.14#64029)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message