hadoop-yarn-issues mailing list archives

Site index · List index
Message view « Date » · « Thread »
Top « Date » · « Thread »
From "David Watzke (JIRA)" <j...@apache.org>
Subject [jira] [Commented] (YARN-1458) FairScheduler: Zero weight can lead to livelock
Date Thu, 24 Mar 2016 11:24:25 GMT

    [ https://issues.apache.org/jira/browse/YARN-1458?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15210097#comment-15210097
] 

David Watzke commented on YARN-1458:
------------------------------------

Seems to me that this is not fixed in 2.6.0. We've hit this bug with CDH 5.4.4 which is shipped
with patched hadoop 2.6.0

{noformat}
"FairSchedulerUpdateThread" daemon prio=10 tid=0x00007f550c0f5800 nid=0x1155 runnable [0x00007f54fdf4d000]
  java.lang.Thread.State: RUNNABLE
       at java.lang.StrictMath.log1p(Native Method)
       at java.lang.Math.log1p(Math.java:1236)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:510)
       - locked <0x00000007a58d9430> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSAppAttempt.getWeights(FSAppAttempt.java:749)
       at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:191)
{noformat}

disabling size-based weights helped immediately (apps were running again)

> FairScheduler: Zero weight can lead to livelock
> -----------------------------------------------
>
>                 Key: YARN-1458
>                 URL: https://issues.apache.org/jira/browse/YARN-1458
>             Project: Hadoop YARN
>          Issue Type: Bug
>          Components: scheduler
>    Affects Versions: 2.2.0
>         Environment: Centos 2.6.18-238.19.1.el5 X86_64
> hadoop2.2.0
>            Reporter: qingwu.fu
>            Assignee: zhihai xu
>              Labels: patch
>         Attachments: YARN-1458.001.patch, YARN-1458.002.patch, YARN-1458.003.patch, YARN-1458.004.patch,
YARN-1458.006.patch, YARN-1458.alternative0.patch, YARN-1458.alternative1.patch, YARN-1458.alternative2.patch,
YARN-1458.patch, yarn-1458-5.patch, yarn-1458-7.patch, yarn-1458-8.patch
>
>   Original Estimate: 408h
>  Remaining Estimate: 408h
>
> The ResourceManager$SchedulerEventDispatcher$EventProcessor blocked when clients submit
lots jobs, it is not easy to reapear. We run the test cluster for days to reapear it. The
output of  jstack command on resourcemanager pid:
> {code}
>  "ResourceManager Event Processor" prio=10 tid=0x00002aaab0c5f000 nid=0x5dd3 waiting
for monitor entry [0x0000000043aa9000]
>    java.lang.Thread.State: BLOCKED (on object monitor)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.removeApplication(FairScheduler.java:671)
>         - waiting to lock <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:1023)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.handle(FairScheduler.java:112)
>         at org.apache.hadoop.yarn.server.resourcemanager.ResourceManager$SchedulerEventDispatcher$EventProcessor.run(ResourceManager.java:440)
>         at java.lang.Thread.run(Thread.java:744)
> ……
> "FairSchedulerUpdateThread" daemon prio=10 tid=0x00002aaab0a2c800 nid=0x5dc8 runnable
[0x00000000433a2000]
>    java.lang.Thread.State: RUNNABLE
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.getAppWeight(FairScheduler.java:545)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.AppSchedulable.getWeights(AppSchedulable.java:129)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShare(ComputeFairShares.java:143)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.resourceUsedWithWeightToResourceRatio(ComputeFairShares.java:131)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.ComputeFairShares.computeShares(ComputeFairShares.java:102)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.policies.FairSharePolicy.computeShares(FairSharePolicy.java:119)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSLeafQueue.recomputeShares(FSLeafQueue.java:100)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FSParentQueue.recomputeShares(FSParentQueue.java:62)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler.update(FairScheduler.java:282)
>         - locked <0x000000070026b6e0> (a org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler)
>         at org.apache.hadoop.yarn.server.resourcemanager.scheduler.fair.FairScheduler$UpdateThread.run(FairScheduler.java:255)
>         at java.lang.Thread.run(Thread.java:744)
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

Mime
View raw message