hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "Sandy Ryza (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1458) hadoop2.2.0 fairscheduler ResourceManager Event Processor thread blocked
Date Sun, 01 Dec 2013 20:06:35 GMT

    [ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13836103#comment-13836103
] 

Sandy Ryza commented on YARN-1458:
----------------------------------

Thanks for tracking this down.  That makes a lot of sense.  While the change you propose may
ultimately be what we choose fix this, I'd like to understand why this is happening.  I.e.
resourceUsedWithWeightToResourceRatio should never return 0, so we should find out what's
causing it to return that.

For resourceUsedWithWeightToResourceRatio to return 0, computeShare must be returning 0 for
the schedulable.  Here is computeShare:
{code}
    double share = sched.getWeights().getWeight(type) * w2rRatio;
    share = Math.max(share, getResourceValue(sched.getMinShare(), type));
    share = Math.min(share, getResourceValue(sched.getMaxShare(), type));
    return (int) share;
{code}
For computeShare to return 0, either the AppSchedulable's weight must be 0 or the max share
must be 0.  An app's weight is 1, unless size based weight is turned on or you've provided
a custom weight adjuster.  An app's max share is always Integer.MAX_VALUE. Have you turned
on size based weight or provided a custom weight adjuster?

> hadoop2.2.0 fairscheduler ResourceManager Event Processor thread blocked
> ------------------------------------------------------------------------
>
>                 Key: YARN-1458
>                 URL: https://issues.apache.org/jira/browse/YARN-1458
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.2.0
>         Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>            Reporter: qingwu.fu
>              Labels: patch
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit
lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The
output of  jstack command on resourcemanager pid:
>  "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 waiting
for monitor entry [0x0000000043aa9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
>         - waiting to lock <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>         at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 runnable
[0x00000000433a2000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
>         at java.lang.Thread.run(Thread.java:744)



--
This message was sent by Atlassian JIRA
(v6.1#6144)

Mime
View raw message