hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Chen Yufei (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-8513) CapacityScheduler infinite loop when queue is near fully utilized
Date Sat, 18 Aug 2018 14:30:00 GMT

    [ https://issues.apache.org/jira/browse/YARN-8513?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=16584787#comment-16584787
] 

Chen Yufei commented on YARN-8513:
----------------------------------

New jstack/top and RM logs are uploaded and prefixed with yarn3. We upgraded to Hadoop 3.1.1
yesterday and encounter this problem several times.

The problem seems reproducible when one queue is near fully utilized. Killing current active
RM can not solve the problem. We have to kill some jobs in the fully utilized queue in order
to submit new jobs.

Debug log shows that CapacityScheduler repeatedly trying to schedule on a specific node, but
as queue resource has exceeded resource limit, allocation proposal won't be accepted. top
command shows only one thread with near 100% CPU usage, strace shows this thread is the one
trying to do the allocation and flushing out logs.

I've tried to dig into source code, but can't find out why RM repeatedly trying to schedule
on a specific node.

Some notes about our setup:

* 3 partitions: default, sim, gpu
* 4 queues: dev & mkt (10% capacity, max 90%), dev-daily & mkt-daily (40% capacity,
max 100%)
* preemption disabled

> CapacityScheduler infinite loop when queue is near fully utilized
> -----------------------------------------------------------------
>
>                 Key: YARN-8513
>                 URL: https://issues.apache.org/jira/browse/YARN-8513
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: capacity scheduler, yarn
>    Affects Versions: 3.1.0, 2.9.1
>         Environment: Ubuntu 14.04.5 and 16.04.4
> YARN is configured with one label and 5 queues.
>            Reporter: Chen Yufei
>            Priority: Major
>         Attachments: jstack-1.log, jstack-2.log, jstack-3.log, jstack-4.log, jstack-5.log,
top-during-lock.log, top-when-normal.log, yarn3-jstack1.log, yarn3-jstack2.log, yarn3-jstack3.log,
yarn3-jstack4.log, yarn3-jstack5.log, yarn3-resourcemanager.log, yarn3-top
>
>
> ResourceManager does not respond to any request when queue is near fully utilized sometimes.
Sending SIGTERM won't stop RM, only SIGKILL can. After RM restart, it can recover running
jobs and start accepting new ones.
>  
> Seems like CapacityScheduler is in an infinite loop printing out the following log messages
(more than 25,000 lines in a second):
>  
> {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.ParentQueue:
assignedContainer queue=root usedCapacity=0.99816763 absoluteUsedCapacity=0.99816763 used=<memory:16170624,
vCores:1577> cluster=<memory:29441544, vCores:5792>}}
> {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.CapacityScheduler:
Failed to accept allocation proposal}}
> {{2018-07-10 17:16:29,227 INFO org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.AbstractContainerAllocator:
assignedContainer application attempt=appattempt_1530619767030_1652_000001 container=null
queue=org.apache.hadoop.yarn.server.resourcemanager.scheduler.capacity.allocator.RegularContainerAllocator@14420943
clusterResource=<memory:29441544, vCores:5792> type=NODE_LOCAL requestedPartition=}}
>  
> I encounter this problem several times after upgrading to YARN 2.9.1, while the same configuration
works fine under version 2.7.3.
>  
> YARN-4477 is an infinite loop bug in FairScheduler, not sure if this is a similar problem.
>  



--
This message was sent by Atlassian JIRA
(v7.6.3#76005)

---------------------------------------------------------------------
To unsubscribe, e-mail: yarn-issues-unsubscribe@hadoop.apache.org
For additional commands, e-mail: yarn-issues-help@hadoop.apache.org


Mime
View raw message