hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Varun Saxena (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED
Date Thu, 23 Feb 2017 00:05:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15879519#comment-15879519
] 

Varun Saxena commented on YARN-3884:
------------------------------------

[~jlowe], [~leftnoteasy]
bq. So ATSv2 will not publish any metric for a container that was allocated to an app but
never launched?
Actually, we can enable publishing container information from RM too in ATSv2. It is just
that it is not recommended because the solution is not scalable as a single process i.e. RM
will be emitting all the entities and if the volume and velocity of writes is high, it will
increase the load on RM.
The solution for publishing entities via NM is scalable as all the writes for an app will
go via a single collector running on some node. Hence writes for containers across multiple
apps will be distributed across multiple nodes(NMs').

You are right that if container publishing is not enabled in RM, a container, for instance,
which has been allocated but gets preempted/killed before it has been acquired/launched by
an AM will not get reflected at the backend in ATSv2. As NM will not know about it.

> App History status not updated when RMContainer transitions from RESERVED to KILLED
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3884
>                 URL: https://issues.apache.org/jira/browse/YARN-3884
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>         Environment: Suse11 Sp3
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>              Labels: oct16-easy
>         Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, Elapsed Time.jpg,
Test Result-Container status.jpg, YARN-3884.0002.patch, YARN-3884.0003.patch, YARN-3884.0004.patch,
YARN-3884.0005.patch, YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===============
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===============
> 1.Submit apps  to Queue 1 with 512 mb 1 core
> 2.Submit apps  to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case 
> {code}
> 2015-07-02 20:45:31,169 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0002_01_000013 Container Transitioned from NEW to RESERVED
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>,
usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, numContainers=5 usedCapacity=1.6410257
absoluteUsedCapacity=0.65625 used=<memory:2560, vCores:21> cluster=<memory:6144,
vCores:32>
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:3072, vCores:26>, usedCapacity=2.0317461, absoluteUsedCapacity=0.8125,
numApps=1, numContainers=6
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.96875 absoluteUsedCapacity=0.96875 used=<memory:5632,
vCores:31> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from NEW to ALLOCATED
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=dsperf   OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS
 APPID=application_1435849994778_0001    CONTAINERID=container_e24_1435849994778_0001_01_000014
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e24_1435849994778_0001_01_000014 of capacity <memory:512,
vCores:1> on host host-10-19-92-117:64318, which has 6 containers, <memory:3072, vCores:14>
used and <memory:0, vCores:2> available after allocation
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
assignedContainer application attempt=appattempt_1435849994778_0001_000001 container=Container:
[ContainerId: container_e24_1435849994778_0001_01_000014, NodeId: host-10-19-92-117:64318,
NodeHttpAddress: host-10-19-92-117:65321, Resource: <memory:512, vCores:1>, Priority:
20, Token: null, ] queue=default: capacity=0.2, absoluteCapacity=0.2, usedResources=<memory:2560,
vCores:5>, usedCapacity=2.0846906, absoluteUsedCapacity=0.41666666, numApps=1, numContainers=5
clusterResource=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.default stats: default: capacity=0.2, absoluteCapacity=0.2,
usedResources=<memory:3072, vCores:6>, usedCapacity=2.5016286, absoluteUsedCapacity=0.5,
numApps=1, numContainers=6
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0 used=<memory:6144,
vCores:32> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,143 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from ALLOCATED to ACQUIRED
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to fulfill reservation for application application_1435849994778_0002 on node: host-10-19-92-143:64318
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, numContainers=6 usedCapacity=2.0317461
absoluteUsedCapacity=0.8125 used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Skipping scheduling since node host-10-19-92-143:64318 is reserved by application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:32,213 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from ACQUIRED to RUNNING
> 2015-07-02 20:45:32,213 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Null container completed...
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to fulfill reservation for application application_1435849994778_0002 on node: host-10-19-92-143:64318
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, numContainers=6 usedCapacity=2.0317461
absoluteUsedCapacity=0.8125 used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Skipping scheduling since node host-10-19-92-143:64318 is reserved by application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:33,704 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Application application_1435849994778_0002 unreserved  on node host: host-10-19-92-143:64318
#containers=5 available=<memory:512, vCores:3> used=<memory:2560, vCores:13>,
currently has 0 at priority 20; currentReservation <memory:0, vCores:0>
> 2015-07-02 20:45:33,704 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
QueueA used=<memory:2560, vCores:21> numContainers=5 user=dsperf user-resources=<memory:2560,
vCores:21>
> 2015-07-02 20:45:33,710 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId: container_e24_1435849994778_0002_01_000013,
NodeId: host-10-19-92-143:64318, NodeHttpAddress: host-10-19-92-143:65321, Resource: <memory:512,
vCores:5>, Priority: 20, Token: null, ] queue=QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:2560, vCores:21>, usedCapacity=1.6410257, absoluteUsedCapacity=0.65625,
numApps=1, numContainers=5 cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,710 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=0.9166667 absoluteUsedCapacity=0.9166667 used=<memory:5632,
vCores:27> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,711 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.QueueA stats: QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:2560, vCores:21>, usedCapacity=1.6410257, absoluteUsedCapacity=0.65625,
numApps=1, numContainers=5
> 2015-07-02 20:45:33,711 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1435849994778_0002_000001 released container container_e24_1435849994778_0002_01_000013
on node: host: host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3>
used=<memory:2560, vCores:13> with event: KILL
> {code}
> *Impact:*
> In application history server the status get updated to -1000 (INVALID)
> but the end time not updated so Elapsed Time always changes.
> Please check the snapshot attached



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message