hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sunil G (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-3884) App History status not updated when RMContainer transitions from RESERVED to KILLED
Date Tue, 21 Feb 2017 09:26:44 GMT

    [ https://issues.apache.org/jira/browse/YARN-3884?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15875662#comment-15875662
] 

Sunil G commented on YARN-3884:
-------------------------------

Few points to discuss:
1. RMContainer was never moved to a final state given container was in RESERVED state and
it is no longer needed by scheduler (ie; container will not be moved to RUNNING or reservation
was not success for this container). As per code, I feel container state will be stuck in
RESERVED in this scenario. This was not an issue because scheduler has cleared this container
from its data structures cleanly.
2. Holding to point 1, ideally we are looking to for a closure to such containers. So in brief,
scheduler has to fire an event to indicate that a reserved container will no longer be used
and RMContainer has to be moved to respective final stages.

Now coming to patch, i think {{FiCaSchedulerApp.unreserve}} is a more better to place raise
an event to container. By this change, any container event could fall to RMContainer with
state RESERVED. so there is a potential chance for invalid state transitions, but in a first
glance it looks like basic events are handled at RESERVED state. May be you could look just
to ensure whether i missed some.
Could we use FinishedTransition of RMConainerImpl which already handling updating finish time
etc. Only extra thing is an event to RMAppAttempt which could be avoided if transition is
coming from RESERVED. Will it be more better? Discussed with [~rohithsharma] , please add
if missed some.

> App History status not updated when RMContainer transitions from RESERVED to KILLED
> -----------------------------------------------------------------------------------
>
>                 Key: YARN-3884
>                 URL: https://issues.apache.org/jira/browse/YARN-3884
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: resourcemanager
>         Environment: Suse11 Sp3
>            Reporter: Bibin A Chundatt
>            Assignee: Bibin A Chundatt
>              Labels: oct16-easy
>         Attachments: 0001-YARN-3884.patch, Apphistory Container Status.jpg, Elapsed Time.jpg,
Test Result-Container status.jpg, YARN-3884.0002.patch, YARN-3884.0003.patch, YARN-3884.0004.patch,
YARN-3884.0005.patch, YARN-3884.0006.patch, YARN-3884.0007.patch, YARN-3884.0008.patch
>
>
> Setup
> ===============
> 1 NM 3072 16 cores each
> Steps to reproduce
> ===============
> 1.Submit apps  to Queue 1 with 512 mb 1 core
> 2.Submit apps  to Queue 2 with 512 mb and 5 core
> lots of containers get reserved and unreserved in this case 
> {code}
> 2015-07-02 20:45:31,169 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0002_01_000013 Container Transitioned from NEW to RESERVED
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:2560, vCores:21>,
usedCapacity=1.6410257, absoluteUsedCapacity=0.65625, numApps=1, numContainers=5 usedCapacity=1.6410257
absoluteUsedCapacity=0.65625 used=<memory:2560, vCores:21> cluster=<memory:6144,
vCores:32>
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.QueueA stats: QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:3072, vCores:26>, usedCapacity=2.0317461, absoluteUsedCapacity=0.8125,
numApps=1, numContainers=6
> 2015-07-02 20:45:31,170 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.96875 absoluteUsedCapacity=0.96875 used=<memory:5632,
vCores:31> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from NEW to ALLOCATED
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.RMAuditLogger:
USER=dsperf   OPERATION=AM Allocated Container        TARGET=SchedulerApp     RESULT=SUCCESS
 APPID=application_1435849994778_0001    CONTAINERID=container_e24_1435849994778_0001_01_000014
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.SchedulerNode:
Assigned container container_e24_1435849994778_0001_01_000014 of capacity <memory:512,
vCores:1> on host host-10-19-92-117:64318, which has 6 containers, <memory:3072, vCores:14>
used and <memory:0, vCores:2> available after allocation
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
assignedContainer application attempt=appattempt_1435849994778_0001_000001 container=Container:
[ContainerId: container_e24_1435849994778_0001_01_000014, NodeId: host-10-19-92-117:64318,
NodeHttpAddress: host-10-19-92-117:65321, Resource: <memory:512, vCores:1>, Priority:
20, Token: null, ] queue=default: capacity=0.2, absoluteCapacity=0.2, usedResources=<memory:2560,
vCores:5>, usedCapacity=2.0846906, absoluteUsedCapacity=0.41666666, numApps=1, numContainers=5
clusterResource=<memory:6144, vCores:32>
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting assigned queue: root.default stats: default: capacity=0.2, absoluteCapacity=0.2,
usedResources=<memory:3072, vCores:6>, usedCapacity=2.5016286, absoluteUsedCapacity=0.5,
numApps=1, numContainers=6
> 2015-07-02 20:45:31,191 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=1.0 absoluteUsedCapacity=1.0 used=<memory:6144,
vCores:32> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,143 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from ALLOCATED to ACQUIRED
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to fulfill reservation for application application_1435849994778_0002 on node: host-10-19-92-143:64318
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, numContainers=6 usedCapacity=2.0317461
absoluteUsedCapacity=0.8125 used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:32,174 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Skipping scheduling since node host-10-19-92-143:64318 is reserved by application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:32,213 INFO org.apache.hadoop.yarn.server.resourcemanager.rmcontainer.RMContainerImpl:
container_e24_1435849994778_0001_01_000014 Container Transitioned from ACQUIRED to RUNNING
> 2015-07-02 20:45:32,213 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Null container completed...
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Trying to fulfill reservation for application application_1435849994778_0002 on node: host-10-19-92-143:64318
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
Reserved container  application=application_1435849994778_0002 resource=<memory:512, vCores:5>
queue=QueueA: capacity=0.4, absoluteCapacity=0.4, usedResources=<memory:3072, vCores:26>,
usedCapacity=2.0317461, absoluteUsedCapacity=0.8125, numApps=1, numContainers=6 usedCapacity=2.0317461
absoluteUsedCapacity=0.8125 used=<memory:3072, vCores:26> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,178 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Skipping scheduling since node host-10-19-92-143:64318 is reserved by application appattempt_1435849994778_0002_000001
> 2015-07-02 20:45:33,704 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.common.fica.FiCaSchedulerApp:
Application application_1435849994778_0002 unreserved  on node host: host-10-19-92-143:64318
#containers=5 available=<memory:512, vCores:3> used=<memory:2560, vCores:13>,
currently has 0 at priority 20; currentReservation <memory:0, vCores:0>
> 2015-07-02 20:45:33,704 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
QueueA used=<memory:2560, vCores:21> numContainers=5 user=dsperf user-resources=<memory:2560,
vCores:21>
> 2015-07-02 20:45:33,710 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.LeafQueue:
completedContainer container=Container: [ContainerId: container_e24_1435849994778_0002_01_000013,
NodeId: host-10-19-92-143:64318, NodeHttpAddress: host-10-19-92-143:65321, Resource: <memory:512,
vCores:5>, Priority: 20, Token: null, ] queue=QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:2560, vCores:21>, usedCapacity=1.6410257, absoluteUsedCapacity=0.65625,
numApps=1, numContainers=5 cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,710 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
completedContainer queue=root usedCapacity=0.9166667 absoluteUsedCapacity=0.9166667 used=<memory:5632,
vCores:27> cluster=<memory:6144, vCores:32>
> 2015-07-02 20:45:33,711 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
Re-sorting completed queue: root.QueueA stats: QueueA: capacity=0.4, absoluteCapacity=0.4,
usedResources=<memory:2560, vCores:21>, usedCapacity=1.6410257, absoluteUsedCapacity=0.65625,
numApps=1, numContainers=5
> 2015-07-02 20:45:33,711 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Application attempt appattempt_1435849994778_0002_000001 released container container_e24_1435849994778_0002_01_000013
on node: host: host-10-19-92-143:64318 #containers=5 available=<memory:512, vCores:3>
used=<memory:2560, vCores:13> with event: KILL
> {code}
> *Impact:*
> In application history server the status get updated to -1000 (INVALID)
> but the end time not updated so Elapsed Time always changes.
> Please check the snapshot attached



--
This message was sent by Atlassian JIRA
(v6.3.15#6346)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message